1. 13 Apr, 2009 2 commits
  2. 03 Apr, 2009 1 commit
  3. 31 Mar, 2009 1 commit
    • Ben.Lippmeier@anu.edu.au's avatar
      SPARC NCG: HpLim is now always stored on the stack, not in a register · 456dc6d6
      Ben.Lippmeier@anu.edu.au authored
         This fixes the out of memory errors we were getting on sparc
         after the following patch:
      
           Fri Mar 13 03:45:16 PDT 2009  Simon Marlow <marlowsd@gmail.com>
           * Instead of a separate context-switch flag, set HpLim to zero
           Ignore-this: 6c5bbe1ce2c5ef551efe98f288483b0
           This reduces the latency between a context-switch being triggered and
           the thread returning to the scheduler, which in turn should reduce the
           cost of the GC barrier when there are many cores. 
      456dc6d6
  4. 18 Mar, 2009 1 commit
  5. 19 Mar, 2009 1 commit
  6. 17 Mar, 2009 3 commits
    • Simon Marlow's avatar
      Add fast event logging · 8b18faef
      Simon Marlow authored
      Generate binary log files from the RTS containing a log of runtime
      events with timestamps.  The log file can be visualised in various
      ways, for investigating runtime behaviour and debugging performance
      problems.  See for example the forthcoming ThreadScope viewer.
      
      New GHC option:
      
        -eventlog   (link-time option) Enables event logging.
      
        +RTS -l     (runtime option) Generates <prog>.eventlog with
                    the binary event information.
      
      This replaces some of the tracing machinery we already had in the RTS:
      e.g. +RTS -vg  for GC tracing (we should do this using the new event
      logging instead).
      
      Event logging has almost no runtime cost when it isn't enabled, though
      in the future we might add more fine-grained events and this might
      change; hence having a link-time option and compiling a separate
      version of the RTS for event logging.  There's a small runtime cost
      for enabling event-logging, for most programs it shouldn't make much
      difference.
      
      (Todo: docs)
      8b18faef
    • Simon Marlow's avatar
      FIX biographical profiling (#3039, probably #2297) · f8f4cb3f
      Simon Marlow authored
      Since we introduced pointer tagging, we no longer always enter a
      closure to evaluate it.  However, the biographical profiler relies on
      closures being entered in order to mark them as "used", so we were
      getting spurious amounts of data attributed to VOID.  It turns out
      there are various places that need to be fixed, and I think at least
      one of them was also wrong before pointer tagging (CgCon.cgReturnDataCon).
      f8f4cb3f
    • Simon Marlow's avatar
      Add getNumberOfProcessors(), FIX MacOS X build problem (hopefully) · 0ee0be10
      Simon Marlow authored
      Somebody needs to implement getNumberOfProcessors() for MacOS X,
      currently it will return 1.
      0ee0be10
  7. 13 Mar, 2009 2 commits
    • Simon Marlow's avatar
      Use work-stealing for load-balancing in the GC · 4e354226
      Simon Marlow authored
        
      New flag: "+RTS -qb" disables load-balancing in the parallel GC
      (though this is subject to change, I think we will probably want to do
      something more automatic before releasing this).
      
      To get the "PARGC3" configuration described in the "Runtime support
      for Multicore Haskell" paper, use "+RTS -qg0 -qb -RTS".
      
      The main advantage of this is that it allows us to easily disable
      load-balancing altogether, which turns out to be important in parallel
      programs.  Maintaining locality is sometimes more important that
      spreading the work out in parallel GC.  There is a side benefit in
      that the parallel GC should have improved locality even when
      load-balancing, because each processor prefers to take work from its
      own queue before stealing from others.
      4e354226
    • Simon Marlow's avatar
      Instead of a separate context-switch flag, set HpLim to zero · 304e7fb7
      Simon Marlow authored
      This reduces the latency between a context-switch being triggered and
      the thread returning to the scheduler, which in turn should reduce the
      cost of the GC barrier when there are many cores.
      
      We still retain the old context_switch flag which is checked at the
      end of each block of allocation.  The idea is that setting HpLim may
      fail if the the target thread is modifying HpLim at the same time; the
      context_switch flag is a fallback.  It also allows us to "context
      switch soon" without forcing an immediate switch, which can be costly.
      304e7fb7
  8. 06 Mar, 2009 1 commit
    • Simon Marlow's avatar
      Partial fix for #2917 · 1b62aece
      Simon Marlow authored
       - add newAlignedPinnedByteArray# for allocating pinned BAs with
         arbitrary alignment
      
       - the old newPinnedByteArray# now aligns to 16 bytes
      
      Foreign.alloca will use newAlignedPinnedByteArray#, and so might end
      up wasting less space than before (we used to align to 8 by default).
      Foreign.allocaBytes and Foreign.mallocForeignPtrBytes will get 16-byte
      aligned memory, which is enough to avoid problems with SSE
      instructions on x86, for example.
      
      There was a bug in the old newPinnedByteArray#: it aligned to 8 bytes,
      but would have failed if the header was not a multiple of 8
      (fortunately it always was, even with profiling).  Also we
      occasionally wasted some space unnecessarily due to alignment in
      allocatePinned().
      
      I haven't done anything about Foreign.malloc/mallocBytes, which will
      give you the same alignment guarantees as malloc() (8 bytes on
      Linux/x86 here).
      1b62aece
  9. 19 Feb, 2009 1 commit
    • Simon Marlow's avatar
      Rewrite of signal-handling (ghc patch; see also base and unix patches) · 7ed3f755
      Simon Marlow authored
      The API is the same (for now).  The new implementation has the
      capability to define signal handlers that have access to the siginfo
      of the signal (#592), but this functionality is not exposed in this
      patch.
      
      #2451 is the ticket for the new API.
      
      The main purpose of bringing this in now is to fix race conditions in
      the old signal handling code (#2858).  Later we can enable the new
      API in the HEAD.
      
      Implementation differences:
      
       - More of the signal-handling is moved into Haskell.  We store the
         table of signal handlers in an MVar, rather than having a table of
         StablePtrs in the RTS.
      
       - In the threaded RTS, the siginfo of the signal is passed down the
         pipe to the IO manager thread, which manages the business of
         starting up new signal handler threads.  In the non-threaded RTS,
         the siginfo of caught signals is stored in the RTS, and the
         scheduler starts new signal handler threads.
      7ed3f755
  10. 12 Feb, 2009 1 commit
  11. 11 Feb, 2009 1 commit
  12. 13 Feb, 2009 1 commit
  13. 11 Feb, 2009 1 commit
  14. 06 Feb, 2009 3 commits
  15. 03 Feb, 2009 1 commit
  16. 27 Jan, 2009 1 commit
  17. 26 Jan, 2009 1 commit
  18. 17 Jan, 2009 1 commit
    • Ian Lynagh's avatar
      Reinstate: Always check the result of pthread_mutex_lock() and pthread_mutex_unlock(). · ee8e3c3f
      Ian Lynagh authored
      Sun Jan  4 19:24:43 GMT 2009  Matthias Kilian <kili@outback.escape.de>
          Don't check pthread_mutex_*lock() only on Linux and/or only if DEBUG
          is defined. The return values of those functions are well defined
          and should be supported on all operation systems with pthreads. The
          checks are cheap enough to do them even in the default build (without
          -DDEBUG).
          
          While here, recycle an unused macro ASSERT_LOCK_NOTHELD, and let
          the debugBelch part enabled with -DLOCK_DEBUG work independently
          of -DDEBUG.
      ee8e3c3f
  19. 16 Jan, 2009 1 commit
    • Simon Marlow's avatar
      UNDO: Always check the result of pthread_mutex_lock() and pthread_mutex_unlock(). · b2d4af35
      Simon Marlow authored
      This patch caused problems on Mac OS X, undoing until we can do it better.
      
      rolling back:
      
      Sun Jan  4 19:24:43 GMT 2009  Matthias Kilian <kili@outback.escape.de>
        * Always check the result of pthread_mutex_lock() and pthread_mutex_unlock().
        
        Don't check pthread_mutex_*lock() only on Linux and/or only if DEBUG
        is defined. The return values of those functions are well defined
        and should be supported on all operation systems with pthreads. The
        checks are cheap enough to do them even in the default build (without
        -DDEBUG).
        
        While here, recycle an unused macro ASSERT_LOCK_NOTHELD, and let
        the debugBelch part enabled with -DLOCK_DEBUG work independently
        of -DDEBUG.
        
      
          M ./includes/OSThreads.h -30 +10
      b2d4af35
  20. 04 Jan, 2009 1 commit
    • kili's avatar
      Always check the result of pthread_mutex_lock() and pthread_mutex_unlock(). · b1fef4d8
      kili authored
      Don't check pthread_mutex_*lock() only on Linux and/or only if DEBUG
      is defined. The return values of those functions are well defined
      and should be supported on all operation systems with pthreads. The
      checks are cheap enough to do them even in the default build (without
      -DDEBUG).
      
      While here, recycle an unused macro ASSERT_LOCK_NOTHELD, and let
      the debugBelch part enabled with -DLOCK_DEBUG work independently
      of -DDEBUG.
      b1fef4d8
  21. 12 Jan, 2009 1 commit
    • Simon Marlow's avatar
      Keep the remembered sets local to each thread during parallel GC · 6a405b1e
      Simon Marlow authored
      This turns out to be quite vital for parallel programs:
      
        - The way we discover which threads to traverse is by finding
          dirty threads via the remembered sets (aka mutable lists).
      
        - A dirty thread will be on the remembered set of the capability
          that was running it, and we really want to traverse that thread's
          stack using the GC thread for the capability, because it is in
          that CPU's cache.  If we get this wrong, we get penalised badly by
          the memory system.
      
      Previously we had per-capability mutable lists but they were
      aggregated before GC and traversed by just one of the GC threads.
      This resulted in very poor performance particularly for parallel
      programs with deep stacks.
      
      Now we keep per-capability remembered sets throughout GC, which also
      removes a lock (recordMutableGen_sync).
      6a405b1e
  22. 10 Dec, 2008 1 commit
  23. 09 Dec, 2008 1 commit
  24. 02 Dec, 2008 1 commit
  25. 14 Aug, 2008 1 commit
    • dias@eecs.harvard.edu's avatar
      Merging in the new codegen branch · 176fa33f
      dias@eecs.harvard.edu authored
      This merge does not turn on the new codegen (which only compiles
      a select few programs at this point),
      but it does introduce some changes to the old code generator.
      
      The high bits:
      1. The Rep Swamp patch is finally here.
         The highlight is that the representation of types at the
         machine level has changed.
         Consequently, this patch contains updates across several back ends.
      2. The new Stg -> Cmm path is here, although it appears to have a
         fair number of bugs lurking.
      3. Many improvements along the CmmCPSZ path, including:
         o stack layout
         o some code for infotables, half of which is right and half wrong
         o proc-point splitting
      176fa33f
  26. 23 Nov, 2008 1 commit
  27. 21 Nov, 2008 1 commit
    • Simon Marlow's avatar
      Use mutator threads to do GC, instead of having a separate pool of GC threads · 3ebcd3de
      Simon Marlow authored
      Previously, the GC had its own pool of threads to use as workers when
      doing parallel GC.  There was a "leader", which was the mutator thread
      that initiated the GC, and the other threads were taken from the pool.
      
      This was simple and worked fine for sequential programs, where we did
      most of the benchmarking for the parallel GC, but falls down for
      parallel programs.  When we have N mutator threads and N cores, at GC
      time we would have to stop N-1 mutator threads and start up N-1 GC
      threads, and hope that the OS schedules them all onto separate cores.
      It practice it doesn't, as you might expect.
      
      Now we use the mutator threads to do GC.  This works quite nicely,
      particularly for parallel programs, where each mutator thread scans
      its own spark pool, which is probably in its cache anyway.
      
      There are some flag changes:
      
        -g<n> is removed (-g1 is still accepted for backwards compat).
        There's no way to have a different number of GC threads than mutator
        threads now.
      
        -q1       Use one OS thread for GC (turns off parallel GC)
        -qg<n>    Use parallel GC for generations >= <n> (default: 1)
      
      Using parallel GC only for generations >=1 works well for sequential
      programs.  Compiling an ordinary sequential program with -threaded and
      running it with -N2 or more should help if you do a lot of GC.  I've
      found that adding -qg0 (do parallel GC for generation 0 too) speeds up
      some parallel programs, but slows down some sequential programs.
      Being conservative, I left the threshold at 1.
      
      ToDo: document the new options.
      3ebcd3de
  28. 18 Nov, 2008 1 commit
    • Simon Marlow's avatar
      Add optional eager black-holing, with new flag -feager-blackholing · d600bf7a
      Simon Marlow authored
      Eager blackholing can improve parallel performance by reducing the
      chances that two threads perform the same computation.  However, it
      has a cost: one extra memory write per thunk entry.  
      
      To get the best results, any code which may be executed in parallel
      should be compiled with eager blackholing turned on.  But since
      there's a cost for sequential code, we make it optional and turn it on
      for the parallel package only.  It might be a good idea to compile
      applications (or modules) with parallel code in with
      -feager-blackholing.
      
      ToDo: document -feager-blackholing.
      d600bf7a
  29. 17 Nov, 2008 1 commit
    • Simon Marlow's avatar
      Attempt to fix #2512 and #2063; add +RTS -xm<address> -RTS option · 11f6f411
      Simon Marlow authored
      On x86_64, the RTS needs to allocate memory in the low 2Gb of the
      address space.  On Linux we can do this with MAP_32BIT, but sometimes
      this doesn't work (#2512) and other OSs don't support it at all
      (#2063).  So to work around this:
      
        - Try MAP_32BIT first, if available.
      
        - Otherwise, try allocating memory from a fixed address (by default
          1Gb)
      
        - We now provide an option to configure the address to allocate
          from.  This allows a workaround on machines where the default
          breaks, and also provides a way for people to test workarounds
          that we can incorporate in future releases.
      11f6f411
  30. 14 Nov, 2008 2 commits
  31. 06 Nov, 2008 1 commit
  32. 22 Oct, 2008 1 commit
    • Simon Marlow's avatar
      Refactoring and reorganisation of the scheduler · 99df892c
      Simon Marlow authored
      Change the way we look for work in the scheduler.  Previously,
      checking to see whether there was anything to do was a
      non-side-effecting operation, but this has changed now that we do
      work-stealing.  This lead to a refactoring of the inner loop of the
      scheduler.
      
      Also, lots of cleanup in the new work-stealing code, but no functional
      changes.
      
      One new statistic is added to the +RTS -s output:
      
        SPARKS: 1430 (2 converted, 1427 pruned)
      
      lets you know something about the use of `par` in the program.
      99df892c
  33. 15 Sep, 2008 1 commit
    • berthold@mathematik.uni-marburg.de's avatar
      Work stealing for sparks · cf9650f2
      berthold@mathematik.uni-marburg.de authored
         Spark stealing support for PARALLEL_HASKELL and THREADED_RTS versions of the RTS.
        
        Spark pools are per capability, separately allocated and held in the Capability 
        structure. The implementation uses Double-Ended Queues (deque) and cas-protected 
        access.
        
        The write end of the queue (position bottom) can only be used with
        mutual exclusion, i.e. by exactly one caller at a time.
        Multiple readers can steal()/findSpark() from the read end
        (position top), and are synchronised without a lock, based on a cas
        of the top position. One reader wins, the others return NULL for a
        failure.
        
        Work stealing is called when Capabilities find no other work (inside yieldCapability),
        and tries all capabilities 0..n-1 twice, unless a theft succeeds.
        
        Inside schedulePushWork, all considered cap.s (those which were idle and could 
        be grabbed) are woken up. Future versions should wake up capabilities immediately when 
        putting a new spark in the local pool, from newSpark().
      
      Patch has been re-recorded due to conflicting bugfixes in the sparks.c, also fixing a 
      (strange) conflict in the scheduler.
      cf9650f2