1. 11 May, 2018 1 commit
  2. 29 Apr, 2017 1 commit
  3. 20 Jul, 2016 1 commit
    • gcampax's avatar
      Compact Regions · cf989ffe
      gcampax authored and Simon Marlow's avatar Simon Marlow committed
      This brings in initial support for compact regions, as described in the
      ICFP 2015 paper "Efficient Communication and Collection with Compact
      Normal Forms" (Edward Z. Yang et.al.) and implemented by Giovanni
      Campagna.
      
      Some things may change before the 8.2 release, but I (Simon M.) wanted
      to get the main patch committed so that we can iterate.
      
      What documentation there is is in the Data.Compact module in the new
      compact package.  We'll need to extend and polish the documentation
      before the release.
      
      Test Plan:
      validate
      (new test cases included)
      
      Reviewers: ezyang, simonmar, hvr, bgamari, austin
      
      Subscribers: vikraman, Yuras, RyanGlScott, qnikst, mboes, facundominguez, rrnewton, thomie, erikd
      
      Differential Revision: https://phabricator.haskell.org/D1264
      
      GHC Trac Issues: #11493
      cf989ffe
  4. 17 Jun, 2016 1 commit
    • Simon Marlow's avatar
      NUMA cleanups · 498ed266
      Simon Marlow authored
      - Move the numaMap and nNumaNodes out of RtsFlags to Capability.c
      - Add a test to tests/rts
      498ed266
  5. 10 Jun, 2016 1 commit
    • Simon Marlow's avatar
      NUMA support · 9e5ea67e
      Simon Marlow authored
      Summary:
      The aim here is to reduce the number of remote memory accesses on
      systems with a NUMA memory architecture, typically multi-socket servers.
      
      Linux provides a NUMA API for doing two things:
      * Allocating memory local to a particular node
      * Binding a thread to a particular node
      
      When given the +RTS --numa flag, the runtime will
      * Determine the number of NUMA nodes (N) by querying the OS
      * Assign capabilities to nodes, so cap C is on node C%N
      * Bind worker threads on a capability to the correct node
      * Keep a separate free lists in the block layer for each node
      * Allocate the nursery for a capability from node-local memory
      * Allocate blocks in the GC from node-local memory
      
      For example, using nofib/parallel/queens on a 24-core 2-socket machine:
      
      ```
      $ ./Main 15 +RTS -N24 -s -A64m
        Total   time  173.960s  (  7.467s elapsed)
      
      $ ./Main 15 +RTS -N24 -s -A64m --numa
        Total   time  150.836s  (  6.423s elapsed)
      ```
      
      The biggest win here is expected to be allocating from node-local
      memory, so that means programs using a large -A value (as here).
      
      According to perf, on this program the number of remote memory accesses
      were reduced by more than 50% by using `--numa`.
      
      Test Plan:
      * validate
      * There's a new flag --debug-numa=<n> that pretends to do NUMA without
        actually making the OS calls, which is useful for testing the code
        on non-NUMA systems.
      * TODO: I need to add some unit tests
      
      Reviewers: erikd, austin, rwbarton, ezyang, bgamari, hvr, niteria
      
      Subscribers: thomie
      
      Differential Revision: https://phabricator.haskell.org/D2199
      9e5ea67e
  6. 04 May, 2016 1 commit
  7. 02 May, 2016 2 commits
    • Simon Marlow's avatar
      Cleanups related to MAX_FREE_LIST · ef446066
      Simon Marlow authored
      - Rename to the (more correct) NUM_FREE_LISTS
      
      - NUM_FREE_LISTS should be derived from the block and mblock sizes, not
        defined manually.  It was actually too large by one, which caused a
        little bit of (benign) extra work in the form of a redundant loop
        iteration in some cases.
      
      - Add some ASSERTs for input preconditions to log_2() and log_2_ceil()
      
      - Fix some comments
      
      - Fix usage in allocLargeChunk, to account for the fact that
        log_2_ceil() can return NUM_FREE_LISTS.
      ef446066
    • U-THEFACEBOOK\smarlow's avatar
      Revert "Revert "Use __builtin_clz() to implement log_1()"" · 5f8c0b84
      U-THEFACEBOOK\smarlow authored and Simon Marlow's avatar Simon Marlow committed
      This reverts commit 546f24e4.
      
      And adds a fix for Windows: we need to use __builtin_clzll() rather than
      __builtin_clzl(), because StgWord is unsigned long long on Windows.
      5f8c0b84
  8. 28 Apr, 2016 1 commit
  9. 26 Apr, 2016 1 commit
  10. 07 Dec, 2015 1 commit
  11. 22 Jul, 2015 1 commit
    • gcampax's avatar
      Two step allocator for 64-bit systems · 0d1a8d09
      gcampax authored and Simon Marlow's avatar Simon Marlow committed
      Summary:
      The current OS memory allocator conflates the concepts of allocating
      address space and allocating memory, which makes the HEAP_ALLOCED()
      implementation excessively complicated (as the only thing it cares
      about is address space layout) and slow. Instead, what we want
      is to allocate a single insanely large contiguous block of address
      space (to make HEAP_ALLOCED() checks fast), and then commit subportions
      of that in 1MB blocks as we did before.
      This is currently behind a flag, USE_LARGE_ADDRESS_SPACE, that is only enabled for
      certain OSes.
      
      Test Plan: validate
      
      Reviewers: simonmar, ezyang, austin
      
      Subscribers: thomie, carter
      
      Differential Revision: https://phabricator.haskell.org/D524
      
      GHC Trac Issues: #9706
      0d1a8d09
  12. 15 Jul, 2015 1 commit
  13. 29 Sep, 2014 1 commit
  14. 20 Aug, 2014 1 commit
  15. 28 Jul, 2014 1 commit
  16. 13 Apr, 2014 1 commit
  17. 31 Dec, 2013 1 commit
  18. 25 Oct, 2013 1 commit
  19. 21 Sep, 2012 1 commit
    • Simon Marlow's avatar
      Allow allocNursery() to allocate single blocks (#7257) · 1f5d8364
      Simon Marlow authored
      Forcing large allocations here can creates serious fragmentation in
      some cases, and since the large allocations are only a small
      optimisation we should allow the nursery to hoover up small blocks
      before allocating large chunks.
      1f5d8364
  20. 07 Sep, 2012 3 commits
  21. 21 Aug, 2012 2 commits
  22. 25 Jan, 2011 1 commit
  23. 15 Dec, 2010 1 commit
    • Simon Marlow's avatar
      Implement stack chunks and separate TSO/STACK objects · f30d5273
      Simon Marlow authored
      This patch makes two changes to the way stacks are managed:
      
      1. The stack is now stored in a separate object from the TSO.
      
      This means that it is easier to replace the stack object for a thread
      when the stack overflows or underflows; we don't have to leave behind
      the old TSO as an indirection any more.  Consequently, we can remove
      ThreadRelocated and deRefTSO(), which were a pain.
      
      This is obviously the right thing, but the last time I tried to do it
      it made performance worse.  This time I seem to have cracked it.
      
      2. Stacks are now represented as a chain of chunks, rather than
         a single monolithic object.
      
      The big advantage here is that individual chunks are marked clean or
      dirty according to whether they contain pointers to the young
      generation, and the GC can avoid traversing clean stack chunks during
      a young-generation collection.  This means that programs with deep
      stacks will see a big saving in GC overhead when using the default GC
      settings.
      
      A secondary advantage is that there is much less copying involved as
      the stack grows.  Programs that quickly grow a deep stack will see big
      improvements.
      
      In some ways the implementation is simpler, as nothing special needs
      to be done to reclaim stack as the stack shrinks (the GC just recovers
      the dead stack chunks).  On the other hand, we have to manage stack
      underflow between chunks, so there's a new stack frame
      (UNDERFLOW_FRAME), and we now have separate TSO and STACK objects.
      The total amount of code is probably about the same as before.
      
      There are new RTS flags:
      
         -ki<size> Sets the initial thread stack size (default 1k)  Egs: -ki4k -ki2m
         -kc<size> Sets the stack chunk size (default 32k)
         -kb<size> Sets the stack chunk buffer size (default 1k)
      
      -ki was previously called just -k, and the old name is still accepted
      for backwards compatibility.  These new options are documented.
      f30d5273
  24. 09 Dec, 2010 1 commit
  25. 29 Oct, 2010 1 commit
  26. 01 Nov, 2010 1 commit
  27. 13 Aug, 2010 1 commit
  28. 24 Jun, 2010 3 commits
  29. 26 May, 2010 1 commit
  30. 03 Dec, 2009 1 commit
    • Simon Marlow's avatar
      GC refactoring, remove "steps" · 214b3663
      Simon Marlow authored
      The GC had a two-level structure, G generations each of T steps.
      Steps are for aging within a generation, mostly to avoid premature
      promotion.  
      
      Measurements show that more than 2 steps is almost never worthwhile,
      and 1 step is usually worse than 2.  In theory fractional steps are
      possible, so the ideal number of steps is somewhere between 1 and 3.
      GHC's default has always been 2.
      
      We can implement 2 steps quite straightforwardly by having each block
      point to the generation to which objects in that block should be
      promoted, so blocks in the nursery point to generation 0, and blocks
      in gen 0 point to gen 1, and so on.
      
      This commit removes the explicit step structures, merging generations
      with steps, thus simplifying a lot of code.  Performance is
      unaffected.  The tunable number of steps is now gone, although it may
      be replaced in the future by a way to tune the aging in generation 0.
      214b3663
  31. 02 Dec, 2009 1 commit
  32. 29 Nov, 2009 1 commit
    • Simon Marlow's avatar
      Store a destination step in the block descriptor · f9d15f9f
      Simon Marlow authored
      At the moment, this just saves a memory reference in the GC inner loop
      (worth a percent or two of GC time).  Later, it will hopefully let me
      experiment with partial steps, and simplifying the generation/step
      infrastructure.
      f9d15f9f
  33. 02 Aug, 2009 1 commit
    • Simon Marlow's avatar
      RTS tidyup sweep, first phase · a2a67cd5
      Simon Marlow authored
      The first phase of this tidyup is focussed on the header files, and in
      particular making sure we are exposinng publicly exactly what we need
      to, and no more.
      
       - Rts.h now includes everything that the RTS exposes publicly,
         rather than a random subset of it.
      
       - Most of the public header files have moved into subdirectories, and
         many of them have been renamed.  But clients should not need to
         include any of the other headers directly, just #include the main
         public headers: Rts.h, HsFFI.h, RtsAPI.h.
      
       - All the headers needed for via-C compilation have moved into the
         stg subdirectory, which is self-contained.  Most of the headers for
         the rest of the RTS APIs have moved into the rts subdirectory.
      
       - I left MachDeps.h where it is, because it is so widely used in
         Haskell code.
       
       - I left a deprecated stub for RtsFlags.h in place.  The flag
         structures are now exposed by Rts.h.
      
       - Various internal APIs are no longer exposed by public header files.
      
       - Various bits of dead code and declarations have been removed
      
       - More gcc warnings are turned on, and the RTS code is more
         warning-clean.
      
       - More source files #include "PosixSource.h", and hence only use
         standard POSIX (1003.1c-1995) interfaces.
      
      There is a lot more tidying up still to do, this is just the first
      pass.  I also intend to standardise the names for external RTS APIs
      (e.g use the rts_ prefix consistently), and declare the internal APIs
      as hidden for shared libraries.
      a2a67cd5
  34. 12 Sep, 2008 1 commit