This project is mirrored from https://gitlab.haskell.org/ghc/ghc.git. Pull mirroring failed .
Repository mirroring has been paused due to too many failed attempts, and can be resumed by a project maintainer.
Last successful update .
  1. 19 Oct, 2019 2 commits
  2. 18 Oct, 2019 2 commits
    • Ömer Sinan Ağacan's avatar
      rts: Non-concurrent mark and sweep · 2309789a
      Ömer Sinan Ağacan authored
      This implements the core heap structure and a serial mark/sweep
      collector which can be used to manage the oldest-generation heap.
      This is the first step towards a concurrent mark-and-sweep collector
      aimed at low-latency applications.
      
      The full design of the collector implemented here is described in detail
      in a technical note
      
          B. Gamari. "A Concurrent Garbage Collector For the Glasgow Haskell
          Compiler" (2018)
      
      The basic heap structure used in this design is heavily inspired by
      
          K. Ueno & A. Ohori. "A fully concurrent garbage collector for
          functional programs on multicore processors." /ACM SIGPLAN Notices/
          Vol. 51. No. 9 (presented by ICFP 2016)
      
      This design is intended to allow both marking and sweeping
      concurrent to execution of a multi-core mutator. Unlike the Ueno design,
      which requires no global synchronization pauses, the collector
      introduced here requires a stop-the-world pause at the beginning and end
      of the mark phase.
      
      To avoid heap fragmentation, the allocator consists of a number of
      fixed-size /sub-allocators/. Each of these sub-allocators allocators into
      its own set of /segments/, themselves allocated from the block
      allocator. Each segment is broken into a set of fixed-size allocation
      blocks (which back allocations) in addition to a bitmap (used to track
      the liveness of blocks) and some additional metadata (used also used
      to track liveness).
      
      This heap structure enables collection via mark-and-sweep, which can be
      performed concurrently via a snapshot-at-the-beginning scheme (although
      concurrent collection is not implemented in this patch).
      
      The mark queue is a fairly straightforward chunked-array structure.
      The representation is a bit more verbose than a typical mark queue to
      accomodate a combination of two features:
      
       * a mark FIFO, which improves the locality of marking, reducing one of
         the major overheads seen in mark/sweep allocators (see [1] for
         details)
      
       * the selector optimization and indirection shortcutting, which
         requires that we track where we found each reference to an object
         in case we need to update the reference at a later point (e.g. when
         we find that it is an indirection). See Note [Origin references in
         the nonmoving collector] (in `NonMovingMark.h`) for details.
      
      Beyond this the mark/sweep is fairly run-of-the-mill.
      
      [1] R. Garner, S.M. Blackburn, D. Frampton. "Effective Prefetch for
          Mark-Sweep Garbage Collection." ISMM 2007.
      Co-Authored-By: Ben Gamari's avatarBen Gamari <ben@well-typed.com>
      2309789a
    • Ömer Sinan Ağacan's avatar
      rts/BlockAlloc: Allow aligned allocation requests · 921e4e36
      Ömer Sinan Ağacan authored
      This implements support for block group allocations which are aligned to
      an integral number of blocks.
      
      This will be used by the nonmoving garbage collector, which uses the
      block allocator to allocate the segments which back its heap. These
      segments are a fixed number of blocks in size, with each segment being
      aligned to the segment size boundary. This allows us to easily find the
      segment metadata stored at the beginning of the segment.
      921e4e36
  3. 10 May, 2018 1 commit
  4. 05 Apr, 2018 1 commit
  5. 29 Apr, 2017 1 commit
  6. 23 Apr, 2017 1 commit
  7. 20 Jul, 2016 1 commit
    • gcampax's avatar
      Compact Regions · cf989ffe
      gcampax authored
      This brings in initial support for compact regions, as described in the
      ICFP 2015 paper "Efficient Communication and Collection with Compact
      Normal Forms" (Edward Z. Yang et.al.) and implemented by Giovanni
      Campagna.
      
      Some things may change before the 8.2 release, but I (Simon M.) wanted
      to get the main patch committed so that we can iterate.
      
      What documentation there is is in the Data.Compact module in the new
      compact package.  We'll need to extend and polish the documentation
      before the release.
      
      Test Plan:
      validate
      (new test cases included)
      
      Reviewers: ezyang, simonmar, hvr, bgamari, austin
      
      Subscribers: vikraman, Yuras, RyanGlScott, qnikst, mboes, facundominguez, rrnewton, thomie, erikd
      
      Differential Revision: https://phabricator.haskell.org/D1264
      
      GHC Trac Issues: #11493
      cf989ffe
  8. 10 Jun, 2016 1 commit
    • Simon Marlow's avatar
      NUMA support · 9e5ea67e
      Simon Marlow authored
      Summary:
      The aim here is to reduce the number of remote memory accesses on
      systems with a NUMA memory architecture, typically multi-socket servers.
      
      Linux provides a NUMA API for doing two things:
      * Allocating memory local to a particular node
      * Binding a thread to a particular node
      
      When given the +RTS --numa flag, the runtime will
      * Determine the number of NUMA nodes (N) by querying the OS
      * Assign capabilities to nodes, so cap C is on node C%N
      * Bind worker threads on a capability to the correct node
      * Keep a separate free lists in the block layer for each node
      * Allocate the nursery for a capability from node-local memory
      * Allocate blocks in the GC from node-local memory
      
      For example, using nofib/parallel/queens on a 24-core 2-socket machine:
      
      ```
      $ ./Main 15 +RTS -N24 -s -A64m
        Total   time  173.960s  (  7.467s elapsed)
      
      $ ./Main 15 +RTS -N24 -s -A64m --numa
        Total   time  150.836s  (  6.423s elapsed)
      ```
      
      The biggest win here is expected to be allocating from node-local
      memory, so that means programs using a large -A value (as here).
      
      According to perf, on this program the number of remote memory accesses
      were reduced by more than 50% by using `--numa`.
      
      Test Plan:
      * validate
      * There's a new flag --debug-numa=<n> that pretends to do NUMA without
        actually making the OS calls, which is useful for testing the code
        on non-NUMA systems.
      * TODO: I need to add some unit tests
      
      Reviewers: erikd, austin, rwbarton, ezyang, bgamari, hvr, niteria
      
      Subscribers: thomie
      
      Differential Revision: https://phabricator.haskell.org/D2199
      9e5ea67e
  9. 04 May, 2016 1 commit
  10. 12 Apr, 2016 1 commit
    • Simon Marlow's avatar
      Allocate blocks in the GC in batches · f4446c5b
      Simon Marlow authored
      Avoids contention for the block allocator lock in the GC; this can be
      seen in the gc_alloc_block_sync counter emitted by +RTS -s.
      
      I experimented with this a while ago, and there was already
      commented-out code for it in GCUtils.c, but I've now improved it so that
      it doesn't result in significantly worse memory usage.
      
      * The old method of putting spare blocks on ws->part_list was wasteful,
        the spare blocks are now shared between all generations and retained
        between GCs.
      
      * repeated allocGroup() results in fragmentation, so I switched to using
        allocLargeChunk() instead which is fragmentation-friendly; we already
        use it for the same reason in nursery allocation.
      f4446c5b
  11. 20 Aug, 2014 1 commit
  12. 13 Mar, 2014 1 commit
  13. 19 Jan, 2014 1 commit
  14. 25 Oct, 2013 1 commit
    • Simon Marlow's avatar
      Make integer overflow less likely to happen (#7762) · 36b042fb
      Simon Marlow authored
      The particular problematic code in #7762 was this:
      
                  nat newSize = size - n;
                  char *freeAddr = MBLOCK_ROUND_DOWN(bd->start);
                  freeAddr += newSize * MBLOCK_SIZE;
                              ^^^^^^^^^^^^^^^^^^^^^^  OVERFLOW!!!
      
      For good measure, I'm going to fix the bug twice.  This patch fixes
      the class of bugs of this kind, by making sure that any expressions
      involving BLOCK_SIZE or MBLOCK_SIZE are promoted to unsigned long.  In
      a separate patch, I'll fix a bunch of individual instances (including
      the one above).
      36b042fb
  15. 13 Nov, 2012 1 commit
  16. 07 Sep, 2012 1 commit
  17. 02 Feb, 2011 1 commit
    • Simon Marlow's avatar
      A small GC optimisation · bef3da1e
      Simon Marlow authored
      Store the *number* of the destination generation in the Bdescr struct,
      so that in evacuate() we don't have to deref gen to get it.
      This is another improvement ported over from my GC branch.
      bef3da1e
  18. 10 Dec, 2010 1 commit
  19. 18 Jul, 2010 1 commit
    • marcotmarcot's avatar
      Don't check for swept blocks in -DS. · eff182c3
      marcotmarcot authored
      The checkHeap function assumed the allocated part of the block contained only
      alive objects and slops.  This was not true for blocks that are collected using
      mark sweep.  The code in this patch skip the test for this kind of blocks.
      eff182c3
  20. 01 Apr, 2010 1 commit
    • Simon Marlow's avatar
      Change the representation of the MVar blocked queue · f4692220
      Simon Marlow authored
      The list of threads blocked on an MVar is now represented as a list of
      separately allocated objects rather than being linked through the TSOs
      themselves.  This lets us remove a TSO from the list in O(1) time
      rather than O(n) time, by marking the list object.  Removing this
      linear component fixes some pathalogical performance cases where many
      threads were blocked on an MVar and became unreachable simultaneously
      (nofib/smp/threads007), or when sending an asynchronous exception to a
      TSO in a long list of thread blocked on an MVar.
      
      MVar performance has actually improved by a few percent as a result of
      this change, slightly to my surprise.
      
      This is the final cleanup in the sequence, which let me remove the old
      way of waking up threads (unblockOne(), MSG_WAKEUP) in favour of the
      new way (tryWakeupThread and MSG_TRY_WAKEUP, which is idempotent).  It
      is now the case that only the Capability that owns a TSO may modify
      its state (well, almost), and this simplifies various things.  More of
      the RTS is based on message-passing between Capabilities now.
      f4692220
  21. 03 Dec, 2009 1 commit
    • Simon Marlow's avatar
      GC refactoring, remove "steps" · 214b3663
      Simon Marlow authored
      The GC had a two-level structure, G generations each of T steps.
      Steps are for aging within a generation, mostly to avoid premature
      promotion.  
      
      Measurements show that more than 2 steps is almost never worthwhile,
      and 1 step is usually worse than 2.  In theory fractional steps are
      possible, so the ideal number of steps is somewhere between 1 and 3.
      GHC's default has always been 2.
      
      We can implement 2 steps quite straightforwardly by having each block
      point to the generation to which objects in that block should be
      promoted, so blocks in the nursery point to generation 0, and blocks
      in gen 0 point to gen 1, and so on.
      
      This commit removes the explicit step structures, merging generations
      with steps, thus simplifying a lot of code.  Performance is
      unaffected.  The tunable number of steps is now gone, although it may
      be replaced in the future by a way to tune the aging in generation 0.
      214b3663
  22. 29 Nov, 2009 1 commit
    • Simon Marlow's avatar
      Store a destination step in the block descriptor · f9d15f9f
      Simon Marlow authored
      At the moment, this just saves a memory reference in the GC inner loop
      (worth a percent or two of GC time).  Later, it will hopefully let me
      experiment with partial steps, and simplifying the generation/step
      infrastructure.
      f9d15f9f
  23. 02 Aug, 2009 1 commit
    • Simon Marlow's avatar
      RTS tidyup sweep, first phase · a2a67cd5
      Simon Marlow authored
      The first phase of this tidyup is focussed on the header files, and in
      particular making sure we are exposinng publicly exactly what we need
      to, and no more.
      
       - Rts.h now includes everything that the RTS exposes publicly,
         rather than a random subset of it.
      
       - Most of the public header files have moved into subdirectories, and
         many of them have been renamed.  But clients should not need to
         include any of the other headers directly, just #include the main
         public headers: Rts.h, HsFFI.h, RtsAPI.h.
      
       - All the headers needed for via-C compilation have moved into the
         stg subdirectory, which is self-contained.  Most of the headers for
         the rest of the RTS APIs have moved into the rts subdirectory.
      
       - I left MachDeps.h where it is, because it is so widely used in
         Haskell code.
       
       - I left a deprecated stub for RtsFlags.h in place.  The flag
         structures are now exposed by Rts.h.
      
       - Various internal APIs are no longer exposed by public header files.
      
       - Various bits of dead code and declarations have been removed
      
       - More gcc warnings are turned on, and the RTS code is more
         warning-clean.
      
       - More source files #include "PosixSource.h", and hence only use
         standard POSIX (1003.1c-1995) interfaces.
      
      There is a lot more tidying up still to do, this is just the first
      pass.  I also intend to standardise the names for external RTS APIs
      (e.g use the rts_ prefix consistently), and declare the internal APIs
      as hidden for shared libraries.
      a2a67cd5
  24. 29 May, 2009 1 commit
  25. 09 Sep, 2008 1 commit
  26. 09 Jun, 2008 1 commit
  27. 16 Apr, 2008 1 commit
  28. 28 Feb, 2008 1 commit
  29. 20 Feb, 2008 1 commit
  30. 14 Dec, 2006 1 commit
    • Simon Marlow's avatar
      Rework the block allocator · 485b8d1a
      Simon Marlow authored
      The main goal here is to reduce fragmentation, which turns out to be
      the case of #743.  While I was here I found some opportunities to
      improve performance too.  The code is rather more complex, but it also
      contains a long comment describing the strategy, so please take a look
      at that for the details.
      485b8d1a
  31. 30 May, 2006 1 commit
    • Simon Marlow's avatar
      replace stgMallocBytesRWX() with our own allocator · e3c55aeb
      Simon Marlow authored
      See bug #738
      
      Allocating executable memory is getting more difficult these days.  In
      particular, the default SELinux policy on Fedora Core 5 disallows
      making the heap (i.e. malloc()'d memory) executable, although it does
      apparently allow mmap()'ing anonymous executable memory by default.
      
      Previously, stgMallocBytesRWX() used malloc() underneath, and then
      tried to make the page holding the memory executable.  This was rather
      hacky and fails with Fedora Core 5.  
      
      This patch adds a mini-allocator for executable memory, based on the
      block allocator.  We grab page-sized blocks and make them executable,
      then allocate small objects from the page.  There's a simple free
      function, that will free whole pages back to the system when they are
      empty.
      e3c55aeb
  32. 07 Apr, 2006 1 commit
    • Simon Marlow's avatar
      Reorganisation of the source tree · 0065d5ab
      Simon Marlow authored
      Most of the other users of the fptools build system have migrated to
      Cabal, and with the move to darcs we can now flatten the source tree
      without losing history, so here goes.
      
      The main change is that the ghc/ subdir is gone, and most of what it
      contained is now at the top level.  The build system now makes no
      pretense at being multi-project, it is just the GHC build system.
      
      No doubt this will break many things, and there will be a period of
      instability while we fix the dependencies.  A straightforward build
      should work, but I haven't yet fixed binary/source distributions.
      Changes to the Building Guide will follow, too.
      0065d5ab
  33. 21 Oct, 2005 1 commit
    • simonmar's avatar
      [project @ 2005-10-21 14:02:17 by simonmar] · 03a9ff01
      simonmar authored
      Big re-hash of the threaded/SMP runtime
      
      This is a significant reworking of the threaded and SMP parts of
      the runtime.  There are two overall goals here:
      
        - To push down the scheduler lock, reducing contention and allowing
          more parts of the system to run without locks.  In particular,
          the scheduler does not require a lock any more in the common case.
      
        - To improve affinity, so that running Haskell threads stick to the
          same OS threads as much as possible.
      
      At this point we have the basic structure working, but there are some
      pieces missing.  I believe it's reasonably stable - the important
      parts of the testsuite pass in all the (normal,threaded,SMP) ways.
      
      In more detail:
      
        - Each capability now has a run queue, instead of one global run
          queue.  The Capability and Task APIs have been completely
          rewritten; see Capability.h and Task.h for the details.
      
        - Each capability has its own pool of worker Tasks.  Hence, Haskell
          threads on a Capability's run queue will run on the same worker
          Task(s).  As long as the OS is doing something reasonable, this
          should mean they usually stick to the same CPU.  Another way to
          look at this is that we're assuming each Capability is associated
          with a fixed CPU.
      
        - What used to be StgMainThread is now part of the Task structure.
          Every OS thread in the runtime has an associated Task, and it
          can ask for its current Task at any time with myTask().
      
        - removed RTS_SUPPORTS_THREADS symbol, use THREADED_RTS instead
          (it is now defined for SMP too).
      
        - The RtsAPI has had to change; we must explicitly pass a Capability
          around now.  The previous interface assumed some global state.
          SchedAPI has also changed a lot.
      
        - The OSThreads API now supports thread-local storage, used to
          implement myTask(), although it could be done more efficiently
          using gcc's __thread extension when available.
      
        - I've moved some POSIX-specific stuff into the posix subdirectory,
          moving in the direction of separating out platform-specific
          implementations.
      
        - lots of lock-debugging and assertions in the runtime.  In particular,
          when DEBUG is on, we catch multiple ACQUIRE_LOCK()s, and there is
          also an ASSERT_LOCK_HELD() call.
      
      What's missing so far:
      
        - I have almost certainly broken the Win32 build, will fix soon.
      
        - any kind of thread migration or load balancing.  This is high up
          the agenda, though.
      
        - various performance tweaks to do
      
        - throwTo and forkProcess still do not work in SMP mode
      03a9ff01
  34. 13 Jun, 2005 1 commit
    • simonmar's avatar
      [project @ 2005-06-13 12:29:48 by simonmar] · b07f3876
      simonmar authored
      Block allocator performance fix: instead of keeping the free list
      ordered, keep it doubly-linked, and introduce a new flag BF_FREE so we
      can tell when a block is free.  We can still coalesce blocks on the
      free list because block descriptors are kept consecutively in memory,
      so we can tell based on the BF_FREE flag whether to coalesce with the
      next higher/lower blocks when freeing a block.
      
      This (almost) make freeChain O(n) rather than O(n^2), and has been
      reported to help a lot when dealing with very large heaps.
      b07f3876
  35. 12 Apr, 2005 1 commit
    • simonmar's avatar
      [project @ 2005-04-12 09:04:23 by simonmar] · 693550d9
      simonmar authored
      Per-task nurseries for SMP.  This was kind-of implemented before, but
      it's much cleaner now.  There is now one *step* per capability, so we
      have somewhere to hang the block count.  So for SMP, there are simply
      multiple instances of generation 0 step 0.  The rNursery entry in the
      register table now points to the step rather than the head block of
      the nurersy.
      693550d9
  36. 27 Mar, 2005 1 commit
    • panne's avatar
      [project @ 2005-03-27 13:41:13 by panne] · 03dc2dd3
      panne authored
      * Some preprocessors don't like the C99/C++ '//' comments after a
        directive, so use '/* */' instead. For consistency, a lot of '//' in
        the include files were converted, too.
      
      * UnDOSified libraries/base/cbits/runProcess.c.
      
      * My favourite sport: Killed $Id$s.
      03dc2dd3
  37. 10 Feb, 2005 1 commit
    • simonmar's avatar
      [project @ 2005-02-10 13:01:52 by simonmar] · e7c3f957
      simonmar authored
      GC changes: instead of threading old-generation mutable lists
      through objects in the heap, keep it in a separate flat array.
      
      This has some advantages:
      
        - the IND_OLDGEN object is now only 2 words, so the minimum
          size of a THUNK is now 2 words instead of 3.  This saves
          some amount of allocation (about 2% on average according to
          my measurements), and is more friendly to the cache by
          squashing objects together more.
      
        - keeping the mutable list separate from the IND object
          will be necessary for our multiprocessor implementation.
      
        - removing the mut_link field makes the layout of some objects
          more uniform, leading to less complexity and special cases.
      
        - I also unified the two mutable lists (mut_once_list and mut_list)
          into a single mutable list, which lead to more simplifications
          in the GC.
      e7c3f957
  38. 13 Aug, 2004 1 commit