1. 08 Mar, 2018 1 commit
  2. 16 Nov, 2017 1 commit
    • Simon Marlow's avatar
      Detect overly long GC sync · 2f463873
      Simon Marlow authored
      Summary:
      GC sync is the time between a GC being intiated and all the mutator
      threads finally stopping so that the GC can start. Problems that cause
      the GC sync to be delayed are hard to find and can cause dramatic
      slowdowns for heavily parallel programs.
      
      The new flag --long-gc-sync=<time> helps by emitting a warning and
      calling a user-overridable hook when the GC sync time exceeds the
      specified threshold. A debugger can be used to set a breakpoint when
      this happens and inspect the stacks of threads to find the culprit.
      
      Test Plan:
      ```
      $ ./inplace/bin/ghc-stage2 +RTS --long-gc-sync=0.0000001 -S
          Alloc    Copied     Live     GC     GC      TOT      TOT  Page Flts
          bytes     bytes     bytes   user   elap     user     elap
        1135856     51144    153736  0.000  0.000    0.002    0.002    0    0  (Gen:  0)
        1034760     94704    188752  0.000  0.000    0.002    0.002    0    0  (Gen:  0)
        1038888    134832    228888  0.009  0.009    0.011    0.011    0    0  (Gen:  1)
        1025288     90128    235184  0.000  0.000    0.012    0.012    0    0  (Gen:  0)
        1049088    130080    333984  0.000  0.000    0.013    0.013    0    0  (Gen:  0)
      Warning: waited 0us for GC sync
        1034424     73360    331976  0.000  0.000    0.013    0.013    0    0  (Gen:  0)
      ```
      
      Also tested on a real production problem.
      
      Reviewers: niteria, bgamari, erikd
      
      Subscribers: rwbarton, thomie
      
      Differential Revision: https://phabricator.haskell.org/D4193
      2f463873
  3. 24 Oct, 2017 1 commit
  4. 22 Oct, 2017 1 commit
    • Tamar Christina's avatar
      Add stack traces on crashes on Windows · 99c61e22
      Tamar Christina authored
      Summary:
      This patch adds the ability to generate stack traces on crashes for Windows.
      When running in the interpreter this attempts to use symbol information from
      the interpreter and information we know about the loaded object files to
      resolve addresses to symbols.
      
      When running compiled it doesn't have this information and then defaults
      to using symbol information from PDB files. Which for now means only
      files compiled with ICC or MSVC will show traces compiled.
      
      But I have a future patch that may address this shortcoming.
      
      Also since I don't know how to walk a pure haskell stack, I can for now
      only show the last entry. I'm hoping to figure out how Apply.cmm works to
      be able to walk the stalk and give more entries for pure haskell code.
      
      In GHCi
      
      ```
      $ echo main | inplace/bin/ghc-stage2.exe --interactive ./testsuite/tests/rts/derefnull.hs
      GHCi, version 8.3.20170830: http://www.haskell.org/ghc/  :? for help
      Ok, 1 module loaded.
      Prelude Main>
      Access violation in generated code when reading 0x0
      
       Attempting to reconstruct a stack trace...
      
         Frame        Code address
       * 0x77cde10    0xc370229 E:\..\base\dist-install\build\HSbase-4.10.0.0.o+0x190031
                       (base_ForeignziStorable_zdfStorableInt4_info+0x3f)
      ```
      
      and compiled
      
      ```
      Access violation in generated code when reading 0x0
      
       Attempting to reconstruct a stack trace...
      
         Frame        Code address
       * 0xf0dbd0     0x40bb01 E:\..\rts\derefnull.run\derefnull.exe+0xbb01
      ```
      
      Test Plan: ./validate
      
      Reviewers: austin, hvr, bgamari, erikd, simonmar
      
      Reviewed By: bgamari
      
      Subscribers: rwbarton, thomie
      
      Differential Revision: https://phabricator.haskell.org/D3913
      99c61e22
  5. 03 Oct, 2017 1 commit
    • Tamar Christina's avatar
      Add ability to produce crash dumps on Windows · ec9ac20d
      Tamar Christina authored
      It's often hard to debug things like segfaults on Windows,
      mostly because gdb isn't always of use and users don't know
      how to effectively use it.
      
      This patch provides a way to create a crash drump by passing
      
      `+RTS --generate-crash-dumps` as an option. If any unhandled
      exception is triggered a dump is made that contains enough
      information to be able to diagnose things successfully.
      
      Currently the created dumps are a bit big because I include
      all registers, code and threads information.
      
      This looks like
      
      ```
      $ testsuite/tests/rts/derefnull.run/derefnull.exe +RTS
      --generate-crash-dumps
      
      Access violation in generated code when reading 0000000000000000
      Crash dump created. Dump written to:
              E:\msys64\tmp\ghc-20170901-220250-11216-16628.dmp
      ```
      
      Test Plan: ./validate
      
      Reviewers: austin, hvr, bgamari, erikd, simonmar
      
      Reviewed By: bgamari, simonmar
      
      Subscribers: rwbarton, thomie
      
      Differential Revision: https://phabricator.haskell.org/D3912
      ec9ac20d
  6. 26 Sep, 2017 3 commits
    • Tamar Christina's avatar
      Switch VEH to VCH and allow disabling of SEH completely. · 1421d87c
      Tamar Christina authored
      Exception handling on Windows is unfortunately a bit complicated.
      But essentially the VEH Handlers we currently have are running too
      early.
      
      This was a problem as it ran so early it also swallowed C++ exceptions
      and other software exceptions which the system could have very well
      recovered from.
      
      So instead we use a sequence of chains to for the exception handlers to
      run as late as possible. You really can't get any later than this.
      
      Please read the comment in the patch for more details.
      
      I'm also providing a switch to allow people to turn off the exception
      handling entirely. In case it does present a problem with their code.
      
      (Reverted and recommitted to fix authorship information)
      
      Test Plan: ./validate
      
      Reviewers: austin, hvr, bgamari, erikd, simonmar
      
      Reviewed By: bgamari
      
      Subscribers: rwbarton, thomie
      
      GHC Trac Issues: #13911, #12110
      
      Differential Revision: https://phabricator.haskell.org/D3911
      1421d87c
    • Ben Gamari's avatar
      Revert "Switch VEH to VCH and allow disabling of SEH completely." · 47888fd8
      Ben Gamari authored
      Reverting to fix authorship of commit.
      
      This reverts commit 1825cbdb.
      47888fd8
    • Ben Gamari's avatar
      Switch VEH to VCH and allow disabling of SEH completely. · 1825cbdb
      Ben Gamari authored
      Exception handling on Windows is unfortunately a bit complicated.
      But essentially the VEH Handlers we currently have are running too
      early.
      
      This was a problem as it ran so early it also swallowed C++ exceptions
      and other software exceptions which the system could have very well
      recovered from.
      
      So instead we use a sequence of chains to for the exception handlers to
      run as late as possible. You really can't get any later than this.
      
      Please read the comment in the patch for more details.
      
      I'm also providing a switch to allow people to turn off the exception
      handling entirely. In case it does present a problem with their code.
      
      Test Plan: ./validate
      
      Reviewers: austin, hvr, bgamari, erikd, simonmar
      
      Reviewed By: bgamari
      
      Subscribers: rwbarton, thomie
      
      GHC Trac Issues: #13911, #12110
      
      Differential Revision: https://phabricator.haskell.org/D3911
      1825cbdb
  7. 28 Jul, 2017 1 commit
    • Andreas Klebinger's avatar
      Add rtsopts ignore and ignoreAll. · d75bba85
      Andreas Klebinger authored
      These ignore commandline arguments for ignore and commandline as well as
      GHCRTS arguments for ignoreAll. Passing RTS flags given on the command
      line along to the program by simply skipping processing of these flags
      by the RTS.
      
      This fixes #12870.
      
      Test Plan: ./validate
      
      Reviewers: austin, hvr, bgamari, erikd, simonmar
      
      Reviewed By: simonmar
      
      Subscribers: Phyx, rwbarton, thomie
      
      GHC Trac Issues: #12870
      
      Differential Revision: https://phabricator.haskell.org/D3740
      d75bba85
  8. 24 Jul, 2017 2 commits
  9. 23 Jul, 2017 3 commits
    • Ben Gamari's avatar
      Fix more documentation wibbles · 2dff2c7f
      Ben Gamari authored
      Fixes #14020, #14016, #14015, #14019
      2dff2c7f
    • Ben Gamari's avatar
      users-guide: Fix various wibbles · c9451959
      Ben Gamari authored
      c9451959
    • patrickdoc's avatar
      users-guide: Standardize and repair all flag references · 44b090be
      patrickdoc authored
      This patch does three things:
      
      1.) It simplifies the flag parsing code in `conf.py` to properly display
      flag definitions created by `.. (ghc|rts)-flag::`. Additionally, all flag
      references must include the associated arguments. Documentation has been
      added to `editing-guide.rst` to explain this.
      
      2.) It normalizes all flag definitions to a similar format. Notably, all
      instances of `<>` have been replaced with `⟨⟩`. All references across the
      users guide have been updated to match.
      
      3.) It fixes a couple issues with the flag reference table's generation code,
      which did not handle comma separated flags in the same cell and did not
      properly reference flags with arguments.
      
      Test Plan:
      `SPHINXOPTS = -n` to activate "nitpicky" mode, which reports all broken
      references. All remaining errors are references to flags without any
      documentation.
      
      Reviewers: austin, bgamari
      
      Reviewed By: bgamari
      
      Subscribers: rwbarton, thomie
      
      GHC Trac Issues: #13980
      
      Differential Revision: https://phabricator.haskell.org/D3778
      44b090be
  10. 31 Jan, 2017 1 commit
    • alexbiehl's avatar
      Abstract over the way eventlogs are flushed · 4dfc6d1c
      alexbiehl authored
      Currently eventlog data is always written to a file `progname.eventlog`.
      This patch introduces the `flushEventLog` field in `RtsConfig` which
      allows to customize the writing of eventlog data.
      
      One possible scenario is the ongoing live-profile-monitor effort by
      @NCrashed which slurps all eventlog data through `fluchEventLog`.
      
      `flushEventLog` takes a buffer with eventlog data and its size and
      returns `false` (0) in case eventlog data could not be procesed.
      
      Reviewers: simonmar, austin, erikd, bgamari
      
      Reviewed By: simonmar, bgamari
      
      Subscribers: qnikst, thomie, NCrashed
      
      Differential Revision: https://phabricator.haskell.org/D2934
      4dfc6d1c
  11. 10 Jan, 2017 1 commit
  12. 09 Oct, 2016 2 commits
    • Simon Marlow's avatar
      Turn on -n4m with -A16m or greater · 85e81a85
      Simon Marlow authored
      Nursery chunks help reduce the cost of GC when capabilities are unevenly
      loaded, by ensuring that we use more of the available nursery.
      
      The rationale for enabling this at -A16m is that any negative effects
      due to loss of cache locality are less likely to be an issue at -A16m
      and above.  It's a conservative guess.  If we had a lot of benchmark
      data we could probably do better.
      
      Results for nofib/parallel at -N4 -A32m with and without -n4m:
      
      ```
      ------------------------------------------------------------------------
              Program           Size    Allocs   Runtime   Elapsed  TotalMem
      ------------------------------------------------------------------------
         blackscholes           0.0%     -9.5%     -9.0%    -15.0%     -2.2%
                coins           0.0%     -4.7%     -3.6%     -0.6%    -13.6%
               mandel           0.0%     -0.3%     +7.7%    +13.1%     +0.1%
              matmult           0.0%     +1.5%    +10.0%     +7.7%     +0.1%
                nbody           0.0%     -4.1%     -2.9%     0.085      0.0%
               parfib           0.0%     -1.4%     +1.0%     +1.5%     +0.2%
              partree           0.0%     -0.3%     +0.8%     +2.9%     -0.8%
                 prsa           0.0%     -0.5%     -2.1%     -7.6%      0.0%
               queens           0.0%     -3.2%     -1.4%     +2.2%     +1.3%
                  ray           0.0%     -5.6%    -14.5%     -7.6%     +0.8%
             sumeuler           0.0%     -0.4%     +2.4%     +1.1%      0.0%
      ------------------------------------------------------------------------
                  Min           0.0%     -9.5%    -14.5%    -15.0%    -13.6%
                  Max           0.0%     +1.5%    +10.0%    +13.1%     +1.3%
       Geometric Mean          +0.0%     -2.6%     -1.3%     -0.5%     -1.4%
      ```
      
      Not conclusive, but slightly better.  This matters a lot more when you
      have more cores.
      
      Test Plan: validate, nofib/paralel
      
      Reviewers: niteria, ezyang, nh2, trofi, austin, erikd, bgamari
      
      Reviewed By: bgamari
      
      Subscribers: thomie
      
      Differential Revision: https://phabricator.haskell.org/D2581
      
      GHC Trac Issues: #9221
      85e81a85
    • Simon Marlow's avatar
      Default +RTS -qn to the number of cores · 6c47f2ef
      Simon Marlow authored
      Setting a -N value that is too large has a dramatic negative effect on
      performance, but the new -qn flag can mitigate the worst of the effects
      by limiting the number of GC threads.
      
      So now, if you don't explcitly set +RTS -qn, and you set -N larger than
      the number of cores (or use setNumCapabilities to do the same), we'll
      default -qn to the number of cores.
      
      These are the results from nofib/parallel on my 4-core (2 cores x 2
      threads) i7 laptop, comparing -N8 before and after this change.
      
      ```
      ------------------------------------------------------------------------
              Program           Size    Allocs   Runtime   Elapsed  TotalMem
      ------------------------------------------------------------------------
         blackscholes          +0.0%     +0.0%    -72.5%    -72.0%     +9.5%
                coins          +0.0%     -0.0%    -73.7%    -72.2%     -0.8%
               mandel          +0.0%     +0.0%    -76.4%    -75.4%     +3.3%
              matmult          +0.0%    +15.5%    -26.8%    -33.4%     +1.0%
                nbody          +0.0%     +2.4%     +0.7%     0.076      0.0%
               parfib          +0.0%     -8.5%    -33.2%    -31.5%     +2.0%
              partree          +0.0%     -0.0%    -60.4%    -56.8%     +5.7%
                 prsa          +0.0%     -0.0%    -65.4%    -60.4%      0.0%
               queens          +0.0%     +0.2%    -58.8%    -58.8%     -1.5%
                  ray          +0.0%     -1.5%    -88.7%    -85.6%     -3.6%
             sumeuler          +0.0%     -0.0%    -47.8%    -46.9%      0.0%
      ------------------------------------------------------------------------
                  Min          +0.0%     -8.5%    -88.7%    -85.6%     -3.6%
                  Max          +0.0%    +15.5%     +0.7%    -31.5%     +9.5%
       Geometric Mean          +0.0%     +0.6%    -61.4%    -63.1%     +1.4%
      ```
      
      Test Plan: validate, nofib/parallel benchmarks
      
      Reviewers: niteria, ezyang, nh2, austin, erikd, trofi, bgamari
      
      Reviewed By: trofi, bgamari
      
      Subscribers: thomie
      
      Differential Revision: https://phabricator.haskell.org/D2580
      
      GHC Trac Issues: #9221
      6c47f2ef
  13. 31 Aug, 2016 1 commit
    • Simon Marlow's avatar
      Bump the default allocation area size to 1MB · d790cb9d
      Simon Marlow authored
      This is long overdue.
      
      Perhaps 1MB is a little on the skinny size, but this is based on
      * A lot of commodity dual-core desktop processors have 3MB L3 cache
      * We're traditionally quite frugal with memory by default
      
      Test Plan: validate
      
      Reviewers: erikd, bgamari, hvr, austin, rwbarton, ezyang
      
      Reviewed By: ezyang
      
      Subscribers: thomie
      
      Differential Revision: https://phabricator.haskell.org/D2496
      d790cb9d
  14. 30 Aug, 2016 1 commit
    • Sergei Trofimovich's avatar
      rts: enable parallel GC scan of large (32M+) allocation area · a5d26f26
      Sergei Trofimovich authored
      Parallel GC does not scan large allocation area (-A)
      effectively as it does not do work stealing from nursery
      by default.
      
      That leads to large imbalance when only one of threads
      overflows allocation area: most of GC threads finish
      quickly (as there is not much to collect) and sit idle
      waiting while single GC thread finishes scan of single
      allocation area for that thread.
      
      The patch enables work stealing for (equivalent of -qb0)
      allocation area of -A32M or higher.
      
      Tested on a highlighting-kate package from Trac #9221
      
      On 8-core machine the difference is around 5% faster
      of wall-clock time. On 24-core VM the speedup is 20%.
      Signed-off-by: default avatarSergei Trofimovich <siarheit@google.com>
      
      Test Plan: measured wall time and GC parallelism on highlighting-kate build
      
      Reviewers: austin, bgamari, erikd, simonmar
      
      Reviewed By: bgamari, simonmar
      
      Subscribers: thomie
      
      Differential Revision: https://phabricator.haskell.org/D2483
      
      GHC Trac Issues: #9221
      a5d26f26
  15. 10 Jun, 2016 1 commit
    • Simon Marlow's avatar
      NUMA support · 9e5ea67e
      Simon Marlow authored
      Summary:
      The aim here is to reduce the number of remote memory accesses on
      systems with a NUMA memory architecture, typically multi-socket servers.
      
      Linux provides a NUMA API for doing two things:
      * Allocating memory local to a particular node
      * Binding a thread to a particular node
      
      When given the +RTS --numa flag, the runtime will
      * Determine the number of NUMA nodes (N) by querying the OS
      * Assign capabilities to nodes, so cap C is on node C%N
      * Bind worker threads on a capability to the correct node
      * Keep a separate free lists in the block layer for each node
      * Allocate the nursery for a capability from node-local memory
      * Allocate blocks in the GC from node-local memory
      
      For example, using nofib/parallel/queens on a 24-core 2-socket machine:
      
      ```
      $ ./Main 15 +RTS -N24 -s -A64m
        Total   time  173.960s  (  7.467s elapsed)
      
      $ ./Main 15 +RTS -N24 -s -A64m --numa
        Total   time  150.836s  (  6.423s elapsed)
      ```
      
      The biggest win here is expected to be allocating from node-local
      memory, so that means programs using a large -A value (as here).
      
      According to perf, on this program the number of remote memory accesses
      were reduced by more than 50% by using `--numa`.
      
      Test Plan:
      * validate
      * There's a new flag --debug-numa=<n> that pretends to do NUMA without
        actually making the OS calls, which is useful for testing the code
        on non-NUMA systems.
      * TODO: I need to add some unit tests
      
      Reviewers: erikd, austin, rwbarton, ezyang, bgamari, hvr, niteria
      
      Subscribers: thomie
      
      Differential Revision: https://phabricator.haskell.org/D2199
      9e5ea67e
  16. 04 May, 2016 2 commits
    • Simon Marlow's avatar
      Add +RTS -AL<size> · f703fd6b
      Simon Marlow authored
      +RTS -AL<size> controls the total size of large objects that can be
      allocated before a GC is triggered.  Previously this was always just the
      value of -A, and the limit mainly existed to prevent runaway allocation
      in pathalogical programs that allocate a lot of large objects.  However,
      since the limit is shared between all cores, on a large multicore the
      default becomes more restrictive, and can end up triggering GC well
      before it would normally have been.
      
      Arguably a better default would be A*N, but this is probably excessive.
      Adding a flag lets you choose, and I've left the default as it was.
      
      See docs for usage.
      f703fd6b
    • Simon Marlow's avatar
      Allow limiting the number of GC threads (+RTS -qn<n>) · 76ee2607
      Simon Marlow authored
      This allows the GC to use fewer threads than the number of capabilities.
      At each GC, we choose some of the capabilities to be "idle", which means
      that the thread running on that capability (if any) will sleep for the
      duration of the GC, and the other threads will do its work.  We choose
      capabilities that are already idle (if any) to be the idle capabilities.
      
      The idea is that this helps in the following situation:
      
      * We want to use a large -N value so as to make use of hyperthreaded
        cores
      * We use a large heap size, so GC is infrequent
      * But we don't want to use all -N threads in the GC, because that
        thrashes the memory too much.
      
      See docs for usage.
      76ee2607
  17. 23 Jan, 2016 1 commit
  18. 09 Jan, 2016 1 commit
  19. 03 Oct, 2015 1 commit