1. 25 Jan, 2020 1 commit
    • PHO's avatar
      Fix rts allocateExec() on NetBSD · 8b726534
      PHO authored and Marge Bot's avatar Marge Bot committed
      Similar to SELinux, NetBSD "PaX mprotect" prohibits marking a page
      mapping both writable and executable at the same time. Use libffi
      which knows how to work around it.
      8b726534
  2. 09 Dec, 2019 1 commit
    • Gabor Greif's avatar
      Fix comment typos · d46a72e1
      Gabor Greif authored and Ben Gamari's avatar Ben Gamari committed
      The below is only necessary to fix the CI perf fluke that
      happened in 9897e8c8:
      -------------------------
      Metric Decrease:
          T5837
          T6048
          T9020
          T12425
          T12234
          T13035
          T12150
          Naperian
      -------------------------
      d46a72e1
  3. 02 Dec, 2019 1 commit
  4. 22 Oct, 2019 1 commit
  5. 21 Oct, 2019 4 commits
    • Ben Gamari's avatar
    • Ben Gamari's avatar
      Don't cleanup until we've stopped the collector · 10373416
      Ben Gamari authored
      This requires that we break nonmovingExit into two pieces since we need
      to first stop the collector to relinquish any capabilities, then we need
      to shutdown the scheduler, then we need to free the nonmoving
      allocators.
      10373416
    • Ben Gamari's avatar
      rts: Implement concurrent collection in the nonmoving collector · bd8e3ff4
      Ben Gamari authored and Ben Gamari's avatar Ben Gamari committed
      
      
      This extends the non-moving collector to allow concurrent collection.
      
      The full design of the collector implemented here is described in detail
      in a technical note
      
          B. Gamari. "A Concurrent Garbage Collector For the Glasgow Haskell
          Compiler" (2018)
      
      This extension involves the introduction of a capability-local
      remembered set, known as the /update remembered set/, which tracks
      objects which may no longer be visible to the collector due to mutation.
      To maintain this remembered set we introduce a write barrier on
      mutations which is enabled while a concurrent mark is underway.
      
      The update remembered set representation is similar to that of the
      nonmoving mark queue, being a chunked array of `MarkEntry`s. Each
      `Capability` maintains a single accumulator chunk, which it flushed
      when it (a) is filled, or (b) when the nonmoving collector enters its
      post-mark synchronization phase.
      
      While the write barrier touches a significant amount of code it is
      conceptually straightforward: the mutator must ensure that the referee
      of any pointer it overwrites is added to the update remembered set.
      However, there are a few details:
      
       * In the case of objects with a dirty flag (e.g. `MVar`s) we can
         exploit the fact that only the *first* mutation requires a write
         barrier.
      
       * Weak references, as usual, complicate things. In particular, we must
         ensure that the referee of a weak object is marked if dereferenced by
         the mutator. For this we (unfortunately) must introduce a read
         barrier, as described in Note [Concurrent read barrier on deRefWeak#]
         (in `NonMovingMark.c`).
      
       * Stable names are also a bit tricky as described in Note [Sweeping
         stable names in the concurrent collector] (`NonMovingSweep.c`).
      
      We take quite some pains to ensure that the high thread count often seen
      in parallel Haskell applications doesn't affect pause times. To this end
      we allow thread stacks to be marked either by the thread itself (when it
      is executed or stack-underflows) or the concurrent mark thread (if the
      thread owning the stack is never scheduled). There is a non-trivial
      handshake to ensure that this happens without racing which is described
      in Note [StgStack dirtiness flags and concurrent marking].
      Co-Authored-by: Ömer Sinan Ağacan's avatarÖmer Sinan Ağacan <omer@well-typed.com>
      bd8e3ff4
    • Ömer Sinan Ağacan's avatar
      rts: Non-concurrent mark and sweep · 68e0647f
      Ömer Sinan Ağacan authored and Ben Gamari's avatar Ben Gamari committed
      
      
      This implements the core heap structure and a serial mark/sweep
      collector which can be used to manage the oldest-generation heap.
      This is the first step towards a concurrent mark-and-sweep collector
      aimed at low-latency applications.
      
      The full design of the collector implemented here is described in detail
      in a technical note
      
          B. Gamari. "A Concurrent Garbage Collector For the Glasgow Haskell
          Compiler" (2018)
      
      The basic heap structure used in this design is heavily inspired by
      
          K. Ueno & A. Ohori. "A fully concurrent garbage collector for
          functional programs on multicore processors." /ACM SIGPLAN Notices/
          Vol. 51. No. 9 (presented by ICFP 2016)
      
      This design is intended to allow both marking and sweeping
      concurrent to execution of a multi-core mutator. Unlike the Ueno design,
      which requires no global synchronization pauses, the collector
      introduced here requires a stop-the-world pause at the beginning and end
      of the mark phase.
      
      To avoid heap fragmentation, the allocator consists of a number of
      fixed-size /sub-allocators/. Each of these sub-allocators allocators into
      its own set of /segments/, themselves allocated from the block
      allocator. Each segment is broken into a set of fixed-size allocation
      blocks (which back allocations) in addition to a bitmap (used to track
      the liveness of blocks) and some additional metadata (used also used
      to track liveness).
      
      This heap structure enables collection via mark-and-sweep, which can be
      performed concurrently via a snapshot-at-the-beginning scheme (although
      concurrent collection is not implemented in this patch).
      
      The mark queue is a fairly straightforward chunked-array structure.
      The representation is a bit more verbose than a typical mark queue to
      accomodate a combination of two features:
      
       * a mark FIFO, which improves the locality of marking, reducing one of
         the major overheads seen in mark/sweep allocators (see [1] for
         details)
      
       * the selector optimization and indirection shortcutting, which
         requires that we track where we found each reference to an object
         in case we need to update the reference at a later point (e.g. when
         we find that it is an indirection). See Note [Origin references in
         the nonmoving collector] (in `NonMovingMark.h`) for details.
      
      Beyond this the mark/sweep is fairly run-of-the-mill.
      
      [1] R. Garner, S.M. Blackburn, D. Frampton. "Effective Prefetch for
          Mark-Sweep Garbage Collection." ISMM 2007.
      Co-Authored-By: Ben Gamari's avatarBen Gamari <ben@well-typed.com>
      68e0647f
  6. 18 Oct, 2019 1 commit
  7. 12 Oct, 2019 1 commit
    • John Ericson's avatar
      Simplify Configure in a few ways · c2290596
      John Ericson authored and Marge Bot's avatar Marge Bot committed
       - No need to distinguish between gcc-llvm and clang. First of all,
         gcc-llvm is quite old and surely unmaintained by now. Second of all,
         none of the code actually care about that distinction!
      
         Now, it does make sense to consider C multiple frontends for LLVMs in
         the form of clang vs clang-cl (same clang, yes, but tweaked
         interface). But this is better handled in terms of "gccish vs
         mvscish" and "is LLVM", yielding 4 combinations. Therefore, I don't
         think it is useful saving the existing code for that.
      
       - Get the remaining CC_LLVM_BACKEND, and also TABLES_NEXT_TO_CODE in
         mk/config.h the normal way, rather than hacking it post-hoc. No point
         keeping these special cases around for now reason.
      
       - Get rid of hand-rolled `die` function and just use `AC_MSG_ERROR`.
      
       - Abstract check + flag override for unregisterised and tables next to
         code.
      
      Oh, and as part of the above I also renamed/combined some variables
      where it felt appropriate.
      
       - GccIsClang -> CcLlvmBackend. This is for `AC_SUBST`, like the other
       Camal case ones. It was never about gcc-llvm, or Apple's renamed clang,
       to be clear.
      
       - llvm_CC_FLAVOR -> CC_LLVM_BACKEND. This is for `AC_DEFINE`, like the
       other all-caps snake case ones. llvm_CC_FLAVOR was just silly
       indirection *and* an odd name to boot.
      c2290596
  8. 03 Oct, 2019 1 commit
  9. 28 Jun, 2019 2 commits
    • Travis Whitaker's avatar
      Correct closure observation, construction, and mutation on weak memory machines. · 11bac115
      Travis Whitaker authored and Ben Gamari's avatar Ben Gamari committed
      
      
      Here the following changes are introduced:
          - A read barrier machine op is added to Cmm.
          - The order in which a closure's fields are read and written is changed.
          - Memory barriers are added to RTS code to ensure correctness on
            out-or-order machines with weak memory ordering.
      
      Cmm has a new CallishMachOp called MO_ReadBarrier. On weak memory machines, this
      is lowered to an instruction that ensures memory reads that occur after said
      instruction in program order are not performed before reads coming before said
      instruction in program order. On machines with strong memory ordering properties
      (e.g. X86, SPARC in TSO mode) no such instruction is necessary, so
      MO_ReadBarrier is simply erased. However, such an instruction is necessary on
      weakly ordered machines, e.g. ARM and PowerPC.
      
      Weam memory ordering has consequences for how closures are observed and mutated.
      For example, consider a closure that needs to be updated to an indirection. In
      order for the indirection to be safe for concurrent observers to enter, said
      observers must read the indirection's info table before they read the
      indirectee. Furthermore, the entering observer makes assumptions about the
      closure based on its info table contents, e.g. an INFO_TYPE of IND imples the
      closure has an indirectee pointer that is safe to follow.
      
      When a closure is updated with an indirection, both its info table and its
      indirectee must be written. With weak memory ordering, these two writes can be
      arbitrarily reordered, and perhaps even interleaved with other threads' reads
      and writes (in the absence of memory barrier instructions). Consider this
      example of a bad reordering:
      
      - An updater writes to a closure's info table (INFO_TYPE is now IND).
      - A concurrent observer branches upon reading the closure's INFO_TYPE as IND.
      - A concurrent observer reads the closure's indirectee and enters it. (!!!)
      - An updater writes the closure's indirectee.
      
      Here the update to the indirectee comes too late and the concurrent observer has
      jumped off into the abyss. Speculative execution can also cause us issues,
      consider:
      
      - An observer is about to case on a value in closure's info table.
      - The observer speculatively reads one or more of closure's fields.
      - An updater writes to closure's info table.
      - The observer takes a branch based on the new info table value, but with the
        old closure fields!
      - The updater writes to the closure's other fields, but its too late.
      
      Because of these effects, reads and writes to a closure's info table must be
      ordered carefully with respect to reads and writes to the closure's other
      fields, and memory barriers must be placed to ensure that reads and writes occur
      in program order. Specifically, updates to a closure must follow the following
      pattern:
      
      - Update the closure's (non-info table) fields.
      - Write barrier.
      - Update the closure's info table.
      
      Observing a closure's fields must follow the following pattern:
      
      - Read the closure's info pointer.
      - Read barrier.
      - Read the closure's (non-info table) fields.
      
      This patch updates RTS code to obey this pattern. This should fix long-standing
      SMP bugs on ARM (specifically newer aarch64 microarchitectures supporting
      out-of-order execution) and PowerPC. This fixes issue #15449.
      Co-Authored-By: Ben Gamari's avatarBen Gamari <ben@well-typed.com>
      11bac115
    • Sylvain Henry's avatar
      Fix GCC warnings with __clear_cache builtin (#16867) · 4ec233ec
      Sylvain Henry authored and Marge Bot's avatar Marge Bot committed
      4ec233ec
  10. 02 Apr, 2019 1 commit
    • Michal Terepeta's avatar
      Improve performance of newSmallArray# · 7cf5ba3d
      Michal Terepeta authored and Marge Bot's avatar Marge Bot committed
      
      
      This:
      - Hoists part of the condition outside of the initialization loop in
        `stg_newSmallArrayzh`.
      - Annotates one of the unlikely branches as unlikely, also in
        `stg_newSmallArrayzh`.
      - Adds a couple of annotations to `allocateMightFail` indicating which
        branches are likely to be taken.
      
      Together this gives about 5% improvement.
      Signed-off-by: Michal Terepeta's avatarMichal Terepeta <michal.terepeta@gmail.com>
      7cf5ba3d
  11. 25 Mar, 2019 1 commit
    • Takenobu Tani's avatar
      Update Wiki URLs to point to GitLab · 3769e3a8
      Takenobu Tani authored and Marge Bot's avatar Marge Bot committed
      This moves all URL references to Trac Wiki to their corresponding
      GitLab counterparts.
      
      This substitution is classified as follows:
      
      1. Automated substitution using sed with Ben's mapping rule [1]
          Old: ghc.haskell.org/trac/ghc/wiki/XxxYyy...
          New: gitlab.haskell.org/ghc/ghc/wikis/xxx-yyy...
      
      2. Manual substitution for URLs containing `#` index
          Old: ghc.haskell.org/trac/ghc/wiki/XxxYyy...#Zzz
          New: gitlab.haskell.org/ghc/ghc/wikis/xxx-yyy...#zzz
      
      3. Manual substitution for strings starting with `Commentary`
          Old: Commentary/XxxYyy...
          New: commentary/xxx-yyy...
      
      See also !539
      
      [1]: https://gitlab.haskell.org/bgamari/gitlab-migration/blob/master/wiki-mapping.json
      3769e3a8
  12. 15 Mar, 2019 1 commit
  13. 27 Jun, 2018 1 commit
  14. 26 Jun, 2018 1 commit
  15. 02 May, 2018 1 commit
  16. 30 Mar, 2018 1 commit
  17. 21 Jan, 2018 1 commit
    • Douglas Wilson's avatar
      [rts] Adjust whitehole_spin · 180ca65f
      Douglas Wilson authored and Ben Gamari's avatar Ben Gamari committed
      Rename to whitehole_gc_spin, in preparation for adding stats for the
      whitehole busy-loop in SMPClosureOps.
      
      Make whitehole_gc_spin volatile, and move it to be defined and
      statically initialised in GC.c. This saves some #ifs, and I'm pretty
      sure it should be volatile.
      
      Test Plan: ./validate
      
      Reviewers: bgamari, erikd, simonmar
      
      Reviewed By: bgamari
      
      Subscribers: rwbarton, thomie, carter
      
      Differential Revision: https://phabricator.haskell.org/D4300
      180ca65f
  18. 26 Sep, 2017 2 commits
  19. 06 Jul, 2017 1 commit
  20. 05 Jul, 2017 3 commits
  21. 22 Jun, 2017 1 commit
    • Sergei Trofimovich's avatar
      UNREG: use __builtin___clear_cache where available · 34b7f63e
      Sergei Trofimovich authored
      
      
      Noticed when was building UNREG ghc with -optc{-Wall,-Werror}:
      
        rts/sm/Storage.c:1359:3: error:
           error: implicit declaration of function '__clear_cache'
             [-Werror=implicit-function-declaration]
             __clear_cache((void*)begin, (void*)end);
             ^~~~~~~~~~~~~
             |
        1359 |   __clear_cache((void*)begin, (void*)end);
             |   ^
      
      Left direct '__clear_cache' usage gcc toolchain before 4.4.
      Signed-off-by: default avatarSergei Trofimovich <slyfox@gentoo.org>
      34b7f63e
  22. 21 Jun, 2017 4 commits
  23. 29 Apr, 2017 1 commit
  24. 02 Apr, 2017 1 commit
    • Simon Marlow's avatar
      Report heap overflow in the same way as stack overflow · 61ba4518
      Simon Marlow authored and Ben Gamari's avatar Ben Gamari committed
      Now that we throw an exception for heap overflow, we should only print
      the heap overflow message in the main thread when the HeapOverflow
      exception is caught, rather than as a side effect in the GC.
      
      Stack overflows were already done this way, I just made heap overflow
      consistent with stack overflow, and did some related cleanup.
      
      Fixes broken T2592(profasm) which was reporting the heap overflow
      message twice (you would only notice when building with profiling
      libs enabled).
      
      Test Plan: validate
      
      Reviewers: bgamari, niteria, austin, DemiMarie, hvr, erikd
      
      Reviewed By: bgamari
      
      Subscribers: rwbarton, thomie
      
      Differential Revision: https://phabricator.haskell.org/D3394
      61ba4518
  25. 07 Dec, 2016 1 commit
    • Simon Marlow's avatar
      Overhaul of Compact Regions (#12455) · 7036fde9
      Simon Marlow authored
      Summary:
      This commit makes various improvements and addresses some issues with
      Compact Regions (aka Compact Normal Forms).
      
      This was the most important thing I wanted to fix.  Compaction
      previously prevented GC from running until it was complete, which
      would be a problem in a multicore setting.  Now, we compact using a
      hand-written Cmm routine that can be interrupted at any point.  When a
      GC is triggered during a sharing-enabled compaction, the GC has to
      traverse and update the hash table, so this hash table is now stored
      in the StgCompactNFData object.
      
      Previously, compaction consisted of a deepseq using the NFData class,
      followed by a traversal in C code to copy the data.  This is now done
      in a single pass with hand-written Cmm (see rts/Compact.cmm). We no
      longer use the NFData instances, instead the Cmm routine evaluates
      components directly as it compacts.
      
      The new compaction is about 50% faster than the old one with no
      sharing, and a little faster on average with sharing (the cost of the
      hash table dominates when we're doing sharing).
      
      Static objects that don't (transitively) refer to any CAFs don't need
      to be copied into the compact region.  In particular this means we
      often avoid copying Char values and small Int values, because these
      are static closures in the runtime.
      
      Each Compact# object can support a single compactAdd# operation at any
      given time, so the Data.Compact library now enforces mutual exclusion
      using an MVar stored in the Compact object.
      
      We now get exceptions rather than killing everything with a barf()
      when we encounter an object that cannot be compacted (a function, or a
      mutable object).  We now also detect pinned objects, which can't be
      compacted either.
      
      The Data.Compact API has been refactored and cleaned up.  A new
      compactSize operation returns the size (in bytes) of the compact
      object.
      
      Most of the documentation is in the Haddock docs for the compact
      library, which I've expanded and improved here.
      
      Various comments in the code have been improved, especially the main
      Note [Compact Normal Forms] in rts/sm/CNF.c.
      
      I've added a few tests, and expanded a few of the tests that were
      there.  We now also run the tests with GHCi, and in a new test way
      that enables sanity checking (+RTS -DS).
      
      There's a benchmark in libraries/compact/tests/compact_bench.hs for
      measuring compaction speed and comparing sharing vs. no sharing.
      
      The field totalDataW in StgCompactNFData was unnecessary.
      
      Test Plan:
      * new unit tests
      * validate
      * tested manually that we can compact Data.Aeson data
      
      Reviewers: gcampax, bgamari, ezyang, austin, niteria, hvr, erikd
      
      Subscribers: thomie, simonpj
      
      Differential Revision: https://phabricator.haskell.org/D2751
      
      GHC Trac Issues: #12455
      7036fde9
  26. 06 Dec, 2016 1 commit
    • Simon Marlow's avatar
      Overhaul GC stats · 24e6594c
      Simon Marlow authored
      Summary:
      Visible API changes:
      
      * The C struct `GCDetails` gives the stats about a single GC.  This is
        passed to the `gcDone()` callback if one is set via the
        RtsConfig. (previously we just passed a collection of values, so this
        is more extensible, at the expense of breaking the existing API)
      
      * `RTSStats` gives cumulative stats since the start of the program,
        and includes the `GCDetails` for the most recent GC.  This struct
        can be obtained via `getRTSStats()` (the old `getGCStats()` has been
        removed, and `getGCStatsEnabled()` has been renamed to
        `getRTSStatsEnabled()`)
      
      Improvements:
      
      * The per-GC stats and cumulative stats are now cleanly separated.
      
      * Inside the RTS we have a top-level `RTSStats` struct to keep all our
        stats in, previously this was just a collection of strangely-named
        variables.  This struct is mostly just copied in `getRTSStats()`, so
        the implementation of that function is a lot shorter.
      
      * Types are more consistent.  We use a uint64_t byte count for all
        memory values, and Time for all time values.
      
      * Names are more consistent.  We use a suffix `_bytes` for all byte
        counts and `_ns` for all time values.
      
      * We now collect information about the amount of memory in large
        objects and compact objects in `GCDetails`. (the latter was the reason
        I started doing this patch but it seems to have ballooned a bit!)
      
      * I fixed a bug in the calculation of the elapsed MUT time, and added
        an ASSERT to stop the calculations going wrong in the future.
      
      For now I kept the Haskell API in `GHC.Stats` the same, by
      impedence-matching with the new API.  We could either break that API
      and make it match the C API more closely, or we could add a new API
      and deprecate the old one.  Opinions welcome.
      
      This stuff is very easy to get wrong, and it's hard to test.  Reviews
      welcome!
      
      Test Plan:
      manual testing
      validate
      
      Reviewers: bgamari, niteria, austin, ezyang, hvr, erikd, rwbarton, Phyx
      
      Subscribers: thomie
      
      Differential Revision: https://phabricator.haskell.org/D2756
      24e6594c
  27. 29 Nov, 2016 1 commit
  28. 16 Nov, 2016 1 commit
  29. 09 Oct, 2016 1 commit
    • Simon Marlow's avatar
      Turn on -n4m with -A16m or greater · 85e81a85
      Simon Marlow authored and Ben Gamari's avatar Ben Gamari committed
      Nursery chunks help reduce the cost of GC when capabilities are unevenly
      loaded, by ensuring that we use more of the available nursery.
      
      The rationale for enabling this at -A16m is that any negative effects
      due to loss of cache locality are less likely to be an issue at -A16m
      and above.  It's a conservative guess.  If we had a lot of benchmark
      data we could probably do better.
      
      Results for nofib/parallel at -N4 -A32m with and without -n4m:
      
      ```
      ------------------------------------------------------------------------
              Program           Size    Allocs   Runtime   Elapsed  TotalMem
      ------------------------------------------------------------------------
         blackscholes           0.0%     -9.5%     -9.0%    -15.0%     -2.2%
                coins           0.0%     -4.7%     -3.6%     -0.6%    -13.6%
               mandel           0.0%     -0.3%     +7.7%    +13.1%     +0.1%
              matmult           0.0%     +1.5%    +10.0%     +7.7%     +0.1%
                nbody           0.0%     -4.1%     -2.9%     0.085      0.0%
               parfib           0.0%     -1.4%     +1.0%     +1.5%     +0.2%
              partree           0.0%     -0.3%     +0.8%     +2.9%     -0.8%
                 prsa           0.0%     -0.5%     -2.1%     -7.6%      0.0%
               queens           0.0%     -3.2%     -1.4%     +2.2%     +1.3%
                  ray           0.0%     -5.6%    -14.5%     -7.6%     +0.8%
             sumeuler           0.0%     -0.4%     +2.4%     +1.1%      0.0%
      ------------------------------------------------------------------------
                  Min           0.0%     -9.5%    -14.5%    -15.0%    -13.6%
                  Max           0.0%     +1.5%    +10.0%    +13.1%     +1.3%
       Geometric Mean          +0.0%     -2.6%     -1.3%     -0.5%     -1.4%
      ```
      
      Not conclusive, but slightly better.  This matters a lot more when you
      have more cores.
      
      Test Plan: validate, nofib/paralel
      
      Reviewers: niteria, ezyang, nh2, trofi, austin, erikd, bgamari
      
      Reviewed By: bgamari
      
      Subscribers: thomie
      
      Differential Revision: https://phabricator.haskell.org/D2581
      
      GHC Trac Issues: #9221
      85e81a85
  30. 16 Aug, 2016 1 commit