    Edward Z. Yang's avatar
      Always run explicitly requested ways (extra_ways) for fast runs. · b89c4913
      Edward Z. Yang authored
      To keep validates fast, we only one run one way.  But I think that
      it's important for some tests to run them a few ways, just to
      make sure functionality, e.g. the profiler, is working.  This commit
      changes the logic so that any way specified in extra_ways is always
      run for fast.  The big changes is now profiling tests are run on
      I also made it so the G1 garbage collector tests only run on slow.
      Signed-off-by: default avatarEdward Z. Yang <ezyang@cs.stanford.edu>
      Test Plan: validate
      Reviewers: austin, thomie, bgamari
      Reviewed By: austin, thomie, bgamari
      Subscribers: thomie
      Differential Revision: https://phabricator.haskell.org/D1251
    rwbarton's avatar
      Be aware of overlapping global STG registers in CmmSink (#10521) · a2f828a3
      rwbarton authored
      On x86_64, commit e2f6bbd3 assigned
      the STG registers F1 and D1 the same hardware register (xmm1), and
      the same for the registers F2 and D2, etc. When mixing calls to
      functions involving Float#s and Double#s, this can cause wrong Cmm
      optimizations that assume the F1 and D1 registers are independent.
      Reviewers: simonpj, austin
      Reviewed By: austin
      Subscribers: simonpj, thomie, bgamari
      Differential Revision: https://phabricator.haskell.org/D993
      GHC Trac Issues: #10521
    Joachim Breitner's avatar
      Test case for #10246 · 8f070924
      Joachim Breitner authored
      still marked known_broken. This also adds the test case for #10245,
      which should pass once #10246 is fixed.
    Joachim Breitner's avatar
      Refactor the story around switches (#10137) · de1160be
      Joachim Breitner authored
      This re-implements the code generation for case expressions at the Stg →
      Cmm level, both for data type cases as well as for integral literal
      cases. (Cases on float are still treated as before).
      The goal is to allow for fancier strategies in implementing them, for a
      cleaner separation of the strategy from the gritty details of Cmm, and
      to run this later than the Common Block Optimization, allowing for one
      way to attack #10124. The new module CmmSwitch contains a number of
      notes explaining this changes. For example, it creates larger
      consecutive jump tables than the previous code, if possible.
      nofib shows little significant overall improvement of runtime. The
      rather large wobbling comes from changes in the code block order
      (see #8082, not much we can do about it). But the decrease in code size
      alone makes this worthwhile.
              Program           Size    Allocs   Runtime   Elapsed  TotalMem
                  Min          -1.8%      0.0%     -6.1%     -6.1%     -2.9%
                  Max          -0.7%     +0.0%     +5.6%     +5.7%     +7.8%
       Geometric Mean          -1.4%     -0.0%     -0.3%     -0.3%     +0.0%
      Compilation time increases slightly:
              -1 s.d.                -----            -2.0%
              +1 s.d.                -----            +2.5%
              Average                -----            +0.3%
      The test case T783 regresses a lot, but it is the only one exhibiting
      any regression. The cause is the changed order of branches in an
      if-then-else tree, which makes the hoople data flow analysis traverse
      the blocks in a suboptimal order. Reverting that gets rid of this
      regression, but has a consistent, if only very small (+0.2%), negative
      effect on runtime. So I conclude that this test is an extreme outlier
      and no reason to change the code.
      Differential Revision: https://phabricator.haskell.org/D720
    Carter Schonwald's avatar
      Changing prefetch primops to have a `seq`-like interface · f44333ea
      Carter Schonwald authored
      The current primops for prefetching do not properly work in pure code;
      namely, the primops are not 'hoisted' into the correct call sites based
      on when arguments are evaluated. Instead, they should use a `seq`-like
      interface, which will cause it to be evaluated when the needed term is.
      See #9353 for the full discussion.
      Test Plan: updated tests for pure prefetch in T8256 to reflect the design changes in #9353
      Reviewers: simonmar, hvr, ekmett, austin
      Reviewed By: ekmett, austin
      Subscribers: merijn, thomie, carter, simonmar
      Differential Revision: https://phabricator.haskell.org/D350
      GHC Trac Issues: #9353
    Simon Marlow's avatar
      Make clearNursery free · e22bc0de
      Simon Marlow authored
      clearNursery resets all the bd->free pointers of nursery blocks to
      make the blocks empty.  In profiles we've seen clearNursery taking
      significant amounts of time particularly with large -N and -A values.
      This patch moves the work of clearNursery to the point at which we
      actually need the new block, thereby introducing an invariant that
      blocks to the right of the CurrentNursery pointer still need their
      bd->free pointer reset.  This should make things faster overall,
      because we don't need to clear blocks that we don't use.
      Test Plan: validate
      Reviewers: AndreasVoellmy, ezyang, austin
      Subscribers: thomie, carter, ezyang, simonmar
      Differential Revision: https://phabricator.haskell.org/D318
    Herbert Valerio Riedel's avatar
      Implement new CLZ and CTZ primops (re #9340) · e0c1767d
      Herbert Valerio Riedel authored
      This implements the new primops
        clz#, clz32#, clz64#,
        ctz#, ctz32#, ctz64#
      which provide efficient implementations of the popular
      count-leading-zero and count-trailing-zero respectively
      (see testcase for a pure Haskell reference implementation).
      On x86, NCG as well as LLVM generates code based on the BSF/BSR
      instructions (which need extra logic to make the 0-case well-defined).
      Test Plan: validate and succesful tests on i686 and amd64
      Reviewers: rwbarton, simonmar, ezyang, austin
      Subscribers: simonmar, relrod, ezyang, carter
      Differential Revision: https://phabricator.haskell.org/D144
      GHC Trac Issues: #9340
    Joachim Breitner's avatar
      CopySmallArrayStressTest needs random · c3108234
      Joachim Breitner authored
    tibbe's avatar
      Add SmallArray# and SmallMutableArray# types · 90329b6c
      tibbe authored
      These array types are smaller than Array# and MutableArray# and are
      faster when the array size is small, as they don't have the overhead
      of a card table. Having no card table reduces the closure size with 2
      words in the typical small array case and leads to less work when
      updating or GC:ing the array.
      Reduces both the runtime and memory allocation by 8.8% on my insert
      benchmark for the HashMap type in the unordered-containers package,
      which makes use of lots of small arrays. With tuned GC settings
      (i.e. `+RTS -A6M`) the runtime reduction is 15%.
      Fixes #8923.
