1. 04 Mar, 2020 1 commit
  2. 25 Jan, 2020 1 commit
  3. 08 Jun, 2017 1 commit
    • Simon Marlow's avatar
      Fix a lost-wakeup bug in BLACKHOLE handling (#13751) · 59847290
      Simon Marlow authored
      Summary:
      The problem occurred when
      * Threads A & B evaluate the same thunk
      * Thread A context-switches, so the thunk gets blackholed
      * Thread C enters the blackhole, creates a BLOCKING_QUEUE attached to
        the blackhole and thread A's `tso->bq` queue
      * Thread B updates the blackhole with a value, overwriting the BLOCKING_QUEUE
      * We GC, replacing A's update frame with stg_enter_checkbh
      * Throw an exception in A, which ignores the stg_enter_checkbh frame
      
      Now we have C blocked on A's tso->bq queue, but we forgot to check the
      queue because the stg_enter_checkbh frame has been thrown away by the
      exception.
      
      The solution and alternative designs are discussed in Note [upd-black-hole].
      
      This also exposed a bug in the interpreter, whereby we were sometimes
      context-switching without calling `threadPaused()`.  I've fixed this
      and added some Notes.
      
      Test Plan:
      * `cd testsuite/tests/concurrent && make slow`
      * validate
      
      Reviewers: niteria, bgamari, austin, erikd
      
      Reviewed By: erikd
      
      Subscribers: rwbarton, thomie
      
      GHC Trac Issues: #13751
      
      Differential Revision: https://phabricator.haskell.org/D3630
      59847290
  4. 29 Apr, 2017 1 commit
  5. 04 Feb, 2017 1 commit
    • Takenobu Tani's avatar
      Fix comment (old file names) in rts/ · 31bb85ff
      Takenobu Tani authored
      [skip ci]
      
      There ware some old file names (.lhs, ...) at comments.
      
      * rts/win32/ThrIOManager.c
        - Conc.lhs -> Conc.hs
      
      * rts/PrimOps.cmm
        - ByteCodeLink.lhs -> ByteCodeLink.hs
        - StgMiscClosures.hc -> StgMiscClosures.cmm
      
      * rts/AutoApply.h
        - AutoApply.hc -> AutoApply.cmm
      
      * rts/HeapStackCheck.cmm
        - PrimOps.hc -> PrimOps.cmm
      
      * rts/LdvProfile.h
        - Updates.hc -> Updates.cmm
      
      * rts/Schedule.c
        - StgStartup.hc -> StgStartup.cmm
      
      * rts/Weak.c
        - StgMiscClosures.hc -> StgMiscClosures.cmm
      
      Reviewers: bgamari, austin, erikd, simonmar
      
      Subscribers: thomie
      
      Differential Revision: https://phabricator.haskell.org/D3075
      31bb85ff
  6. 10 Jun, 2016 1 commit
    • Simon Marlow's avatar
      NUMA support · 9e5ea67e
      Simon Marlow authored
      Summary:
      The aim here is to reduce the number of remote memory accesses on
      systems with a NUMA memory architecture, typically multi-socket servers.
      
      Linux provides a NUMA API for doing two things:
      * Allocating memory local to a particular node
      * Binding a thread to a particular node
      
      When given the +RTS --numa flag, the runtime will
      * Determine the number of NUMA nodes (N) by querying the OS
      * Assign capabilities to nodes, so cap C is on node C%N
      * Bind worker threads on a capability to the correct node
      * Keep a separate free lists in the block layer for each node
      * Allocate the nursery for a capability from node-local memory
      * Allocate blocks in the GC from node-local memory
      
      For example, using nofib/parallel/queens on a 24-core 2-socket machine:
      
      ```
      $ ./Main 15 +RTS -N24 -s -A64m
        Total   time  173.960s  (  7.467s elapsed)
      
      $ ./Main 15 +RTS -N24 -s -A64m --numa
        Total   time  150.836s  (  6.423s elapsed)
      ```
      
      The biggest win here is expected to be allocating from node-local
      memory, so that means programs using a large -A value (as here).
      
      According to perf, on this program the number of remote memory accesses
      were reduced by more than 50% by using `--numa`.
      
      Test Plan:
      * validate
      * There's a new flag --debug-numa=<n> that pretends to do NUMA without
        actually making the OS calls, which is useful for testing the code
        on non-NUMA systems.
      * TODO: I need to add some unit tests
      
      Reviewers: erikd, austin, rwbarton, ezyang, bgamari, hvr, niteria
      
      Subscribers: thomie
      
      Differential Revision: https://phabricator.haskell.org/D2199
      9e5ea67e
  7. 08 Sep, 2015 1 commit
  8. 06 Jul, 2015 1 commit
  9. 25 Nov, 2014 1 commit
    • Simon Marlow's avatar
      Make clearNursery free · e22bc0de
      Simon Marlow authored
      Summary:
      clearNursery resets all the bd->free pointers of nursery blocks to
      make the blocks empty.  In profiles we've seen clearNursery taking
      significant amounts of time particularly with large -N and -A values.
      
      This patch moves the work of clearNursery to the point at which we
      actually need the new block, thereby introducing an invariant that
      blocks to the right of the CurrentNursery pointer still need their
      bd->free pointer reset.  This should make things faster overall,
      because we don't need to clear blocks that we don't use.
      
      Test Plan: validate
      
      Reviewers: AndreasVoellmy, ezyang, austin
      
      Subscribers: thomie, carter, ezyang, simonmar
      
      Differential Revision: https://phabricator.haskell.org/D318
      e22bc0de
  10. 12 Nov, 2014 1 commit
  11. 21 Oct, 2014 1 commit
  12. 01 Aug, 2014 1 commit
    • Simon Marlow's avatar
      interruptible() was not returning true for BlockedOnSTM (#9379) · 9d9a5546
      Simon Marlow authored
      Summary:
      There's an knock-on fix in HeapStackCheck.c which is potentially
      scary, but I'm pretty confident is OK.  See comment for details.
      
      Test Plan:
      I've run all the STM
      tests I can find, including libraries/stm/tests/stm049 with +RTS -N8
      and some of the constants bumped to make it more of a stress test.
      
      Reviewers: hvr, rwbarton, austin
      
      Subscribers: simonmar, relrod, ezyang, carter
      
      Differential Revision: https://phabricator.haskell.org/D104
      
      GHC Trac Issues: #9379
      9d9a5546
  13. 04 May, 2014 1 commit
  14. 02 May, 2014 1 commit
    • Simon Marlow's avatar
      Per-thread allocation counters and limits · b0534f78
      Simon Marlow authored
      This tracks the amount of memory allocation by each thread in a
      counter stored in the TSO.  Optionally, when the counter drops below
      zero (it counts down), the thread can be sent an asynchronous
      exception: AllocationLimitExceeded.  When this happens, given a small
      additional limit so that it can handle the exception.  See
      documentation in GHC.Conc for more details.
      
      Allocation limits are similar to timeouts, but
      
        - timeouts use real time, not CPU time.  Allocation limits do not
          count anything while the thread is blocked or in foreign code.
      
        - timeouts don't re-trigger if the thread catches the exception,
          allocation limits do.
      
        - timeouts can catch non-allocating loops, if you use
          -fno-omit-yields.  This doesn't work for allocation limits.
      
      I couldn't measure any impact on benchmarks with these changes, even
      for nofib/smp.
      b0534f78
  15. 29 Apr, 2014 1 commit
    • Simon Marlow's avatar
      Fix scavenge_stack crash (#9045) · ab8bb489
      Simon Marlow authored
      The new stg_gc_prim_p_ll stack frame was missing an info table.  This
      is a regression since 7.6, because this stuff was part of a cleanup
      that happened in 7.7.
      ab8bb489
  16. 01 Oct, 2013 1 commit
    • Simon Marlow's avatar
      Remove use of R9, and fix associated bugs · 11b5ce55
      Simon Marlow authored
      We were passing the function address to stg_gc_prim_p in R9, which was
      wrong because the call was a high-level call and didn't declare R9 as
      a parameter.  Passing R9 as an argument is the right way, but
      unfortunately that exposed another bug: we were using the same macro
      in some low-level Cmm, where it is illegal to call functions with
      arguments (see Note [syntax of cmm files]).  So we now have low-level
      variants of STK_CHK() and STK_CHK_P() for use in low-level Cmm code.
      11b5ce55
  17. 13 Jul, 2013 1 commit
  18. 09 Jul, 2013 1 commit
  19. 01 Nov, 2012 1 commit
  20. 30 Oct, 2012 1 commit
  21. 19 Oct, 2012 1 commit
  22. 09 Oct, 2012 1 commit
  23. 08 Oct, 2012 1 commit
    • Simon Marlow's avatar
      Produce new-style Cmm from the Cmm parser · a7c0387d
      Simon Marlow authored
      The main change here is that the Cmm parser now allows high-level cmm
      code with argument-passing and function calls.  For example:
      
      foo ( gcptr a, bits32 b )
      {
        if (b > 0) {
           // we can make tail calls passing arguments:
           jump stg_ap_0_fast(a);
        }
      
        return (x,y);
      }
      
      More details on the new cmm syntax are in Note [Syntax of .cmm files]
      in CmmParse.y.
      
      The old syntax is still more-or-less supported for those occasional
      code fragments that really need to explicitly manipulate the stack.
      However there are a couple of differences: it is now obligatory to
      give a list of live GlobalRegs on every jump, e.g.
      
        jump %ENTRY_CODE(Sp(0)) [R1];
      
      Again, more details in Note [Syntax of .cmm files].
      
      I have rewritten most of the .cmm files in the RTS into the new
      syntax, except for AutoApply.cmm which is generated by the genapply
      program: this file could be generated in the new syntax instead and
      would probably be better off for it, but I ran out of enthusiasm.
      
      Some other changes in this batch:
      
       - The PrimOp calling convention is gone, primops now use the ordinary
         NativeNodeCall convention.  This means that primops and "foreign
         import prim" code must be written in high-level cmm, but they can
         now take more than 10 arguments.
      
       - CmmSink now does constant-folding (should fix #7219)
      
       - .cmm files now go through the cmmPipeline, and as a result we
         generate better code in many cases.  All the object files generated
         for the RTS .cmm files are now smaller.  Performance should be
         better too, but I haven't measured it yet.
      
       - RET_DYN frames are removed from the RTS, lots of code goes away
      
       - we now have some more canned GC points to cover unboxed-tuples with
         2-4 pointers, which will reduce code size a little.
      a7c0387d
  24. 20 Mar, 2012 2 commits
  25. 15 Mar, 2012 2 commits
  26. 27 Feb, 2012 1 commit
  27. 01 Dec, 2011 1 commit
    • Simon Marlow's avatar
      Fix a scheduling bug in the threaded RTS · 6d18141d
      Simon Marlow authored
      The parallel GC was using setContextSwitches() to stop all the other
      threads, which sets the context_switch flag on every Capability.  That
      had the side effect of causing every Capability to also switch
      threads, and since GCs can be much more frequent than context
      switches, this increased the context switch frequency.  When context
      switches are expensive (because the switch is between two bound
      threads or a bound and unbound thread), the difference is quite
      noticeable.
      
      The fix is to have a separate flag to indicate that a Capability
      should stop and return to the scheduler, but not switch threads.  I've
      called this the "interrupt" flag.
      6d18141d
  28. 29 Nov, 2011 2 commits
  29. 20 Apr, 2010 1 commit
    • Simon Marlow's avatar
      Fix crash in non-threaded RTS on Windows · 870b59f5
      Simon Marlow authored
      The tso->block_info field is now overwritten by pushOnRunQueue(), but
      stg_block_async_info was assuming that it still held a pointer to the
      StgAsyncIOResult.  We must therefore save this value somewhere safe
      before putting the TSO on the run queue.
      870b59f5
  30. 01 Apr, 2010 1 commit
    • Simon Marlow's avatar
      Change the representation of the MVar blocked queue · f4692220
      Simon Marlow authored
      The list of threads blocked on an MVar is now represented as a list of
      separately allocated objects rather than being linked through the TSOs
      themselves.  This lets us remove a TSO from the list in O(1) time
      rather than O(n) time, by marking the list object.  Removing this
      linear component fixes some pathalogical performance cases where many
      threads were blocked on an MVar and became unreachable simultaneously
      (nofib/smp/threads007), or when sending an asynchronous exception to a
      TSO in a long list of thread blocked on an MVar.
      
      MVar performance has actually improved by a few percent as a result of
      this change, slightly to my surprise.
      
      This is the final cleanup in the sequence, which let me remove the old
      way of waking up threads (unblockOne(), MSG_WAKEUP) in favour of the
      new way (tryWakeupThread and MSG_TRY_WAKEUP, which is idempotent).  It
      is now the case that only the Capability that owns a TSO may modify
      its state (well, almost), and this simplifies various things.  More of
      the RTS is based on message-passing between Capabilities now.
      f4692220
  31. 29 Mar, 2010 1 commit
    • Simon Marlow's avatar
      New implementation of BLACKHOLEs · 5d52d9b6
      Simon Marlow authored
      This replaces the global blackhole_queue with a clever scheme that
      enables us to queue up blocked threads on the closure that they are
      blocked on, while still avoiding atomic instructions in the common
      case.
      
      Advantages:
      
       - gets rid of a locked global data structure and some tricky GC code
         (replacing it with some per-thread data structures and different
         tricky GC code :)
      
       - wakeups are more prompt: parallel/concurrent performance should
         benefit.  I haven't seen anything dramatic in the parallel
         benchmarks so far, but a couple of threading benchmarks do improve
         a bit.
      
       - waking up a thread blocked on a blackhole is now O(1) (e.g. if
         it is the target of throwTo).
      
       - less sharing and better separation of Capabilities: communication
         is done with messages, the data structures are strictly owned by a
         Capability and cannot be modified except by sending messages.
      
       - this change will utlimately enable us to do more intelligent
         scheduling when threads block on each other.  This is what started
         off the whole thing, but it isn't done yet (#3838).
      
      I'll be documenting all this on the wiki in due course.
      5d52d9b6
  32. 25 Mar, 2010 1 commit
  33. 11 Mar, 2010 2 commits
    • Simon Marlow's avatar
    • Simon Marlow's avatar
      Use message-passing to implement throwTo in the RTS · 7408b392
      Simon Marlow authored
      This replaces some complicated locking schemes with message-passing
      in the implementation of throwTo. The benefits are
      
       - previously it was impossible to guarantee that a throwTo from
         a thread running on one CPU to a thread running on another CPU
         would be noticed, and we had to rely on the GC to pick up these
         forgotten exceptions. This no longer happens.
      
       - the locking regime is simpler (though the code is about the same
         size)
      
       - threads can be unblocked from a blocked_exceptions queue without
         having to traverse the whole queue now.  It's a rare case, but
         replaces an O(n) operation with an O(1).
      
       - generally we move in the direction of sharing less between
         Capabilities (aka HECs), which will become important with other
         changes we have planned.
      
      Also in this patch I replaced several STM-specific closure types with
      a generic MUT_PRIM closure type, which allowed a lot of code in the GC
      and other places to go away, hence the line-count reduction.  The
      message-passing changes resulted in about a net zero line-count
      difference.
      7408b392
  34. 03 Aug, 2009 1 commit
  35. 02 Jun, 2009 1 commit
  36. 18 Mar, 2009 1 commit