Skip to content
Snippets Groups Projects
  1. Jan 08, 2024
  2. Aug 09, 2023
  3. Apr 26, 2023
    • Sebastian Graf's avatar
      DmdAnal: Unleash demand signatures of free RULE and unfolding binders (#23208) · c30ac25f
      Sebastian Graf authored and Marge Bot's avatar Marge Bot committed
      In #23208 we observed that the demand signature of a binder occuring in a RULE
      wasn't unleashed, leading to a transitively used binder being discarded as
      absent. The solution was to use the same code path that we already use for
      handling exported bindings.
      
      See the changes to `Note [Absence analysis for stable unfoldings and RULES]`
      for more details.
      
      I took the chance to factor out the old notion of a `PlusDmdArg` (a pair of a
      `VarEnv Demand` and a `Divergence`) into `DmdEnv`, which fits nicely into our
      existing framework. As a result, I had to touch quite a few places in the code.
      
      This refactoring exposed a few small bugs around correct handling of bottoming
      demand environments. As a result, some strictness signatures now mention uniques
      that weren't there before which caused test output changes to T13143, T19969 and
      T22112. But these tests compared whole -ddump-simpl listings which is a very
      fragile thing to begin with. I changed what exactly they test for based on the
      symptoms in the corresponding issues.
      
      There is a single regression in T18894 because we are more conservative around
      stable unfoldings now. Unfortunately it is not easily fixed; let's wait until
      there is a concrete motivation before invest more time.
      
      Fixes #23208.
      c30ac25f
  4. Jan 11, 2023
    • Simon Peyton Jones's avatar
      Fix void-arg-adding mechanism for worker/wrapper · 964284fc
      Simon Peyton Jones authored and Marge Bot's avatar Marge Bot committed
      As #22725 shows, in worker/wrapper we must add the void argument
      /last/, not first.  See GHC.Core.Opt.WorkWrap.Utils
      Note [Worker/wrapper needs to add void arg last].
      
      That led me to to study GHC.Core.Opt.SpecConstr
      Note [SpecConstr needs to add void args first] which suggests the
      opposite!  And indeed I think it's the other way round for SpecConstr
      -- or more precisely the void arg must precede the "extra_bndrs".
      
      That led me to some refactoring of GHC.Core.Opt.SpecConstr.calcSpecInfo.
      964284fc
  5. Sep 28, 2022
    • Simon Peyton Jones's avatar
      Refactor UnfoldingSource and IfaceUnfolding · addeefc0
      Simon Peyton Jones authored and Marge Bot's avatar Marge Bot committed
      I finally got tired of the way that IfaceUnfolding reflected
      a previous structure of unfoldings, not the current one. This
      MR refactors UnfoldingSource and IfaceUnfolding to be simpler
      and more consistent.
      
      It's largely just a refactor, but in UnfoldingSource (which moves
      to GHC.Types.Basic, since it is now used in IfaceSyn too), I
      distinguish between /user-specified/ and /system-generated/ stable
      unfoldings.
      
          data UnfoldingSource
            = VanillaSrc
            | StableUserSrc   -- From a user-specified pragma
            | StableSystemSrc -- From a system-generated unfolding
            | CompulsorySrc
      
      This has a minor effect in CSE (see the use of isisStableUserUnfolding
      in GHC.Core.Opt.CSE), which I tripped over when working on
      specialisation, but it seems like a Good Thing to know anyway.
      addeefc0
  6. Jun 27, 2022
    • Andreas Klebinger's avatar
      Don't mark lambda binders as OtherCon · ac7a7fc8
      Andreas Klebinger authored and Marge Bot's avatar Marge Bot committed
      We used to put OtherCon unfoldings on lambda binders of workers
      and sometimes also join points/specializations with with the
      assumption that since the wrapper would force these arguments
      once we execute the RHS they would indeed be in WHNF.
      
      This was wrong for reasons detailed in #21472. So now we purge
      evaluated unfoldings from *all* lambda binders.
      
      This fixes #21472, but at the cost of sometimes not using as efficient a
      calling convention. It can also change inlining behaviour as some
      occurances will no longer look like value arguments when they did
      before.
      
      As consequence we also change how we compute CBV information for
      arguments slightly. We now *always* determine the CBV convention
      for arguments during tidy. Earlier in the pipeline we merely mark
      functions as candidates for having their arguments treated as CBV.
      
      As before the process is described in the relevant notes:
      Note [CBV Function Ids]
      Note [Attaching CBV Marks to ids]
      Note [Never put `OtherCon` unfoldigns on lambda binders]
      
      -------------------------
      Metric Decrease:
          T12425
          T13035
          T18223
          T18223
          T18923
          MultiLayerModulesTH_OneShot
      Metric Increase:
          WWRec
      -------------------------
      ac7a7fc8
  7. May 05, 2022
  8. Mar 16, 2022
    • Sebastian Graf's avatar
      Demand: Let `Boxed` win in `lubBoxity` (#21119) · 1575c4a5
      Sebastian Graf authored and Marge Bot's avatar Marge Bot committed
      Previously, we let `Unboxed` win in `lubBoxity`, which is unsoundly optimistic
      in terms ob Boxity analysis. "Unsoundly" in the sense that we sometimes unbox
      parameters that we better shouldn't unbox. Examples are #18907 and T19871.absent.
      
      Until now, we thought that this hack pulled its weight becuase it worked around
      some shortcomings of the phase separation between Boxity analysis and CPR
      analysis. But it is a gross hack which caused regressions itself that needed all
      kinds of fixes and workarounds. See for example #20767. It became impossible to
      work with in !7599, so I want to remove it.
      
      For example, at the moment, `lubDmd B dmd` will not unbox `dmd`,
      but `lubDmd A dmd` will. Given that `B` is supposed to be the bottom element of
      the lattice, it's hardly justifiable to get a better demand when `lub`bing with
      `A`.
      
      The consequence of letting `Boxed` win in `lubBoxity` is that we *would* regress
       #2387, #16040 and parts of #5075 and T19871.sumIO, until Boxity and CPR
      are able to communicate better. Fortunately, that is not the case since I could
      tweak the other source of optimism in Boxity analysis that is described in
      `Note [Unboxed demand on function bodies returning small products]` so that
      we *recursively* assume unboxed demands on function bodies returning small
      products. See the updated Note.
      
      `Note [Boxity for bottoming functions]` describes why we need bottoming
      functions to have signatures that say that they deeply unbox their arguments.
      In so doing, I had to tweak `finaliseArgBoxities` so that it will never unbox
      recursive data constructors. This is in line with our handling of them in CPR.
      I updated `Note [Which types are unboxed?]` to reflect that.
      
      In turn we fix #21119, #20767, #18907, T19871.absent and get a much simpler
      implementation (at least to think about). We can also drop the very ad-hoc
      definition of `deferAfterPreciseException` and its Note in favor of the
      simple, intuitive definition we used to have.
      
      Metric Decrease:
          T16875
          T18223
          T18698a
          T18698b
          hard_hole_fits
      Metric Increase:
          LargeRecord
          MultiComponentModulesRecomp
          T15703
          T8095
          T9872d
      
      Out of all the regresions, only the one in T9872d doesn't vanish in a perf
      build, where the compiler is bootstrapped with -O2 and thus SpecConstr.
      Reason for regressions:
      
        * T9872d is due to `ty_co_subst` taking its `LiftingContext` boxed.
          That is because the context is passed to a function argument, for
          example in `liftCoSubstTyVarBndrUsing`.
        * In T15703, LargeRecord and T8095, we get a bit more allocations in
          `expand_syn` and `piResultTys`, because a `TCvSubst` isn't unboxed.
          In both cases that guards against reboxing in some code paths.
        * The same is true for MultiComponentModulesRecomp, where we get less unboxing
          in `GHC.Unit.Finder.$wfindInstalledHomeModule`. In a perf build, allocations
          actually *improve* by over 4%!
      
      Results on NoFib:
      
      --------------------------------------------------------------------------------
              Program         Allocs    Instrs
      --------------------------------------------------------------------------------
               awards          -0.4%     +0.3%
            cacheprof          -0.3%     +2.4%
                  fft          -1.5%     -5.1%
             fibheaps          +1.2%     +0.8%
                fluid          -0.3%     -0.1%
                  ida          +0.4%     +0.9%
         k-nucleotide          +0.4%     -0.1%
           last-piece         +10.5%    +13.9%
                 lift          -4.4%     +3.5%
              mandel2         -99.7%    -99.8%
                 mate          -0.4%     +3.6%
               parser          -1.0%     +0.1%
               puzzle         -11.6%     +6.5%
      reverse-complem          -3.0%     +2.0%
                  scs          -0.5%     +0.1%
               sphere          -0.4%     -0.2%
            wave4main          -8.2%     -0.3%
      --------------------------------------------------------------------------------
      Summary excludes mandel2 because of excessive bias
                  Min         -11.6%     -5.1%
                  Max         +10.5%    +13.9%
       Geometric Mean          -0.2%     +0.3%
      --------------------------------------------------------------------------------
      
      Not bad for a bug fix.
      
      The regression in `last-piece` could become a win if SpecConstr would work on
      non-recursive functions. The regression in `fibheaps` is due to
      `Note [Reboxed crud for bottoming calls]`, e.g., #21128.
      1575c4a5
  9. Feb 12, 2022
    • Andreas Klebinger's avatar
      Tag inference work. · 0e93023e
      Andreas Klebinger authored and Matthew Pickering's avatar Matthew Pickering committed
      This does three major things:
      * Enforce the invariant that all strict fields must contain tagged
      pointers.
      * Try to predict the tag on bindings in order to omit tag checks.
      * Allows functions to pass arguments unlifted (call-by-value).
      
      The former is "simply" achieved by wrapping any constructor allocations with
      a case which will evaluate the respective strict bindings.
      
      The prediction is done by a new data flow analysis based on the STG
      representation of a program. This also helps us to avoid generating
      redudant cases for the above invariant.
      
      StrictWorkers are created by W/W directly and SpecConstr indirectly.
      See the Note [Strict Worker Ids]
      
      Other minor changes:
      
      * Add StgUtil module containing a few functions needed by, but
        not specific to the tag analysis.
      
      -------------------------
      Metric Decrease:
      	T12545
      	T18698b
      	T18140
      	T18923
              LargeRecord
      Metric Increase:
              LargeRecord
      	ManyAlternatives
      	ManyConstructors
      	T10421
      	T12425
      	T12707
      	T13035
      	T13056
      	T13253
      	T13253-spj
      	T13379
      	T15164
      	T18282
      	T18304
      	T18698a
      	T1969
      	T20049
      	T3294
      	T4801
      	T5321FD
      	T5321Fun
      	T783
      	T9233
      	T9675
      	T9961
      	T19695
      	WWRec
      -------------------------
      0e93023e
  10. Oct 24, 2021
    • Sebastian Graf's avatar
      DmdAnal: Implement Boxity Analysis (#19871) · 3bab222c
      Sebastian Graf authored and Marge Bot's avatar Marge Bot committed
      This patch fixes some abundant reboxing of `DynFlags` in
      `GHC.HsToCore.Match.Literal.warnAboutOverflowedLit` (which was the topic
      of #19407) by introducing a Boxity analysis to GHC, done as part of demand
      analysis. This allows to accurately capture ad-hoc unboxing decisions previously
      made in worker/wrapper in demand analysis now, where the boxity info can
      propagate through demand signatures.
      
      See the new `Note [Boxity analysis]`. The actual fix for #19407 is described in
      `Note [No lazy, Unboxed demand in demand signature]`, but
      `Note [Finalising boxity for demand signature]` is probably a better entry-point.
      
      To support the fix for #19407, I had to change (what was)
      `Note [Add demands for strict constructors]` a bit
      (now `Note [Unboxing evaluated arguments]`). In particular, we now take care of
      it in `finaliseBoxity` (which is only called from demand analaysis) instead of
      `wantToUnboxArg`.
      
      I also had to resurrect `Note [Product demands for function body]` and rename
      it to `Note [Unboxed demand on function bodies returning small products]` to
      avoid huge regressions in `join004` and `join007`, thereby fixing #4267 again.
      See the updated Note for details.
      
      A nice side-effect is that the worker/wrapper transformation no longer needs to
      look at strictness info and other bits such as `InsideInlineableFun` flags
      (needed for `Note [Do not unbox class dictionaries]`) at all. It simply collects
      boxity info from argument demands and interprets them with a severely simplified
      `wantToUnboxArg`. All the smartness is in `finaliseBoxity`, which could be moved
      to DmdAnal completely, if it wasn't for the call to `dubiousDataConInstArgTys`
      which would be awkward to export.
      
      I spent some time figuring out the reason for why `T16197` failed prior to my
      amendments to `Note [Unboxing evaluated arguments]`. After having it figured
      out, I minimised it a bit and added `T16197b`, which simply compares computed
      strictness signatures and thus should be far simpler to eyeball.
      
      The 12% ghc/alloc regression in T11545 is because of the additional `Boxity`
      field in `Poly` and `Prod` that results in more allocation during `lubSubDmd`
      and `plusSubDmd`. I made sure in the ticky profiles that the number of calls
      to those functions stayed the same. We can bear such an increase here, as we
      recently improved it by -68% (in b760c1f7).
      T18698* regress slightly because there is more unboxing of dictionaries
      happening and that causes Lint (mostly) to allocate more.
      
      Fixes #19871, #19407, #4267, #16859, #18907 and #13331.
      
      Metric Increase:
          T11545
          T18698a
          T18698b
      
      Metric Decrease:
          T12425
          T16577
          T18223
          T18282
          T4267
          T9961
      3bab222c
  11. Sep 28, 2021
  12. Sep 08, 2021
  13. Jun 27, 2021
    • Sebastian Graf's avatar
      WorkWrap: Remove mkWWargs (#19874) · eee498bf
      Sebastian Graf authored and Marge Bot's avatar Marge Bot committed
      `mkWWargs`'s job was pushing casts inwards and doing eta expansion to match
      the arity with the number of argument demands we w/w for.
      
      Nowadays, we use the Simplifier to eta expand to arity. In fact, in recent years
      we have even seen the eta expansion done by w/w as harmful, see Note [Don't eta
      expand in w/w]. If a function hasn't enough manifest lambdas, don't w/w it!
      
      What purpose does `mkWWargs` serve in this world? Not a great one, it turns out!
      I could remove it by pulling some important bits,
      notably Note [Freshen WW arguments] and Note [Join points and beta-redexes].
      Result: We reuse the freshened binder names of the wrapper in the
      worker where possible (see testuite changes), much nicer!
      
      In order to avoid scoping errors due to lambda-bound unfoldings in worker
      arguments, we zap those unfoldings now. In doing so, we fix #19766.
      
      Fixes #19874.
      eee498bf
  14. Apr 20, 2021
    • Sebastian Graf's avatar
      Worker/wrapper: Refactor CPR WW to work for nested CPR (#18174) · fdbead70
      Sebastian Graf authored
      In another small step towards bringing a manageable variant of Nested
      CPR into GHC, this patch refactors worker/wrapper to be able to exploit
      Nested CPR signatures. See the new Note [Worker/wrapper for CPR].
      
      The nested code path is currently not triggered, though, because all
      signatures that we annotate are still flat. So purely a refactoring.
      I am very confident that it works, because I ripped it off !1866 95%
      unchanged.
      
      A few test case outputs changed, but only it's auxiliary names only.
      I also added test cases for #18109 and #18401.
      
      There's a 2.6% metric increase in T13056 after a rebase, caused by an
      additional Simplifier run. It appears b1d0b9c saw a similar additional
      iteration. I think it's just a fluke.
      
      Metric Increase:
          T13056
      fdbead70
  15. Mar 20, 2021
    • Sebastian Graf's avatar
      Nested CPR light (#19398) · 044e5be3
      Sebastian Graf authored and Marge Bot's avatar Marge Bot committed
      While fixing #19232, it became increasingly clear that the vestigial
      hack described in `Note [Optimistic field binder CPR]` is complicated
      and causes reboxing. Rather than make the hack worse, this patch
      gets rid of it completely in favor of giving deeply unboxed parameters
      the Nested CPR property. Example:
      ```hs
      f :: (Int, Int) -> Int
      f p = case p of
       (x, y) | x == y    = x
              | otherwise = y
      ```
      Based on `p`'s `idDemandInfo` `1P(1P(L),1P(L))`, we can see that both
      fields of `p` will be available unboxed. As a result, we give `p` the
      nested CPR property `1(1,1)`. When analysing the `case`, the field
      CPRs are transferred to the binders `x` and `y`, respectively, so that
      we ultimately give `f` the CPR property.
      
      I took the liberty to do a bit of refactoring:
      
      - I renamed `CprResult` ("Constructed product result result") to plain
        `Cpr`.
      - I Introduced `FlatConCpr` in addition to (now nested) `ConCpr` and
        and according pattern synonym that rewrites flat `ConCpr` to
        `FlatConCpr`s, purely for compiler perf reasons.
      - Similarly for performance reasons, we now store binders with a
        Top signature in a separate `IntSet`,
        see `Note [Efficient Top sigs in SigEnv]`.
      - I moved a bit of stuff around in `GHC.Core.Opt.WorkWrap.Utils` and
        introduced `UnboxingDecision` to replace the `Maybe DataConPatContext`
        type we used to return from `wantToUnbox`.
      - Since the `Outputable Cpr` instance changed anyway, I removed the
        leading `m` which we used to emit for `ConCpr`. It's just noise,
        especially now that we may output nested CPRs.
      
      Fixes #19398.
      044e5be3
  16. Mar 03, 2021
  17. Dec 14, 2020
    • Ben Gamari's avatar
      Optimise nullary type constructor usage · dad87210
      Ben Gamari authored
      During the compilation of programs GHC very frequently deals with
      the `Type` type, which is a synonym of `TYPE 'LiftedRep`. This patch
      teaches GHC to avoid expanding the `Type` synonym (and other nullary
      type synonyms) during type comparisons, saving a good amount of work.
      This optimisation is described in `Note [Comparing nullary type
      synonyms]`.
      
      To maximize the impact of this optimisation, we introduce a few
      special-cases to reduce `TYPE 'LiftedRep` to `Type`. See
      `Note [Prefer Type over TYPE 'LiftedPtrRep]`.
      
      Closes #17958.
      
      Metric Decrease:
         T18698b
         T1969
         T12227
         T12545
         T12707
         T14683
         T3064
         T5631
         T5642
         T9020
         T9630
         T9872a
         T13035
         haddock.Cabal
         haddock.base
      dad87210
    • Ben Gamari's avatar
      Revert "Optimise nullary type constructor usage" · 92377c27
      Ben Gamari authored
      This was inadvertently merged.
      
      This reverts commit 7e9debd4.
      92377c27
    • Ben Gamari's avatar
      Optimise nullary type constructor usage · 7e9debd4
      Ben Gamari authored
      During the compilation of programs GHC very frequently deals with
      the `Type` type, which is a synonym of `TYPE 'LiftedRep`. This patch
      teaches GHC to avoid expanding the `Type` synonym (and other nullary
      type synonyms) during type comparisons, saving a good amount of work.
      This optimisation is described in `Note [Comparing nullary type
      synonyms]`.
      
      To maximize the impact of this optimisation, we introduce a few
      special-cases to reduce `TYPE 'LiftedRep` to `Type`. See
      `Note [Prefer Type over TYPE 'LiftedPtrRep]`.
      
      Closes #17958.
      
      Metric Decrease:
         T18698b
         T1969
         T12227
         T12545
         T12707
         T14683
         T3064
         T5631
         T5642
         T9020
         T9630
         T9872a
         T13035
         haddock.Cabal
         haddock.base
      7e9debd4
  18. Nov 20, 2020
    • Sebastian Graf's avatar
      Demand: Interleave usage and strictness demands (#18903) · 0aec78b6
      Sebastian Graf authored and Marge Bot's avatar Marge Bot committed
      As outlined in #18903, interleaving usage and strictness demands not
      only means a more compact demand representation, but also allows us to
      express demands that we weren't easily able to express before.
      
      Call demands are *relative* in the sense that a call demand `Cn(cd)`
      on `g` says "`g` is called `n` times. *Whenever `g` is called*, the
      result is used according to `cd`". Example from #18903:
      
      ```hs
      h :: Int -> Int
      h m =
        let g :: Int -> (Int,Int)
            g 1 = (m, 0)
            g n = (2 * n, 2 `div` n)
            {-# NOINLINE g #-}
        in case m of
          1 -> 0
          2 -> snd (g m)
          _ -> uncurry (+) (g m)
      ```
      
      Without the interleaved representation, we would just get `L` for the
      strictness demand on `g`. Now we are able to express that whenever
      `g` is called, its second component is used strictly in denoting `g`
      by `1C1(P(1P(U),SP(U)))`. This would allow Nested CPR to unbox the
      division, for example.
      
      Fixes #18903.
      While fixing regressions, I also discovered and fixed #18957.
      
      Metric Decrease:
          T13253-spj
      0aec78b6
  19. Oct 14, 2020
    • Simon Peyton Jones's avatar
      Fix some missed opportunities for preInlineUnconditionally · 15d2340c
      Simon Peyton Jones authored and Marge Bot's avatar Marge Bot committed
      There are two signficant changes here:
      
      * Ticket #18815 showed that we were missing some opportunities for
        preInlineUnconditionally.  The one-line fix is in the code for
        GHC.Core.Opt.Simplify.Utils.preInlineUnconditionally, which now
        switches off only for INLINE pragmas.  I expanded
        Note [Stable unfoldings and preInlineUnconditionally] to explain.
      
      * When doing this I discovered a way in which preInlineUnconditionally
        was occasionally /too/ eager.  It's all explained in
        Note [Occurrences in stable unfoldings] in GHC.Core.Opt.OccurAnal,
        and the one-line change adding markAllMany to occAnalUnfolding.
      
      I also got confused about what NoUserInline meant, so I've renamed
      it to NoUserInlinePrag, and changed its pretty-printing slightly.
      That led to soem error messate wibbling, and touches quite a few
      files, but there is no change in functionality.
      
      I did a nofib run.  As expected, no significant changes.
      
              Program           Size    Allocs
      ----------------------------------------
               sphere          -0.0%     -0.4%
      ----------------------------------------
                  Min          -0.0%     -0.4%
                  Max          -0.0%     +0.0%
       Geometric Mean          -0.0%     -0.0%
      
      I'm allowing a max-residency increase for T10370, which seems
      very irreproducible. (See comments on !4241.)  There is always
      sampling error for max-residency measurements; and in any case
      the change shows up on some platforms but not others.
      
      Metric Increase:
          T10370
      15d2340c
  20. Jul 28, 2020
    • Simon Peyton Jones's avatar
      This patch addresses the exponential blow-up in the simplifier. · 0bd60059
      Simon Peyton Jones authored and Marge Bot's avatar Marge Bot committed
      Specifically:
        #13253 exponential inlining
        #10421 ditto
        #18140 strict constructors
        #18282 another nested-function call case
      
      This patch makes one really significant changes: change the way that
      mkDupableCont handles StrictArg.  The details are explained in
      GHC.Core.Opt.Simplify Note [Duplicating StrictArg].
      
      Specific changes
      
      * In mkDupableCont, when making auxiliary bindings for the other arguments
        of a call, add extra plumbing so that we don't forget the demand on them.
        Otherwise we haev to wait for another round of strictness analysis. But
        actually all the info is to hand.  This change affects:
        - Make the strictness list in ArgInfo be [Demand] instead of [Bool],
          and rename it to ai_dmds.
        - Add as_dmd to ValArg
        - Simplify.makeTrivial takes a Demand
        - mkDupableContWithDmds takes a [Demand]
      
      There are a number of other small changes
      
      1. For Ids that are used at most once in each branch of a case, make
         the occurrence analyser record the total number of syntactic
         occurrences.  Previously we recorded just OneBranch or
         MultipleBranches.
      
         I thought this was going to be useful, but I ended up barely
         using it; see Note [Note [Suppress exponential blowup] in
         GHC.Core.Opt.Simplify.Utils
      
         Actual changes:
           * See the occ_n_br field of OneOcc.
           * postInlineUnconditionally
      
      2. I found a small perf buglet in SetLevels; see the new
         function GHC.Core.Opt.SetLevels.hasFreeJoin
      
      3. Remove the sc_cci field of StrictArg.  I found I could get
         its information from the sc_fun field instead.  Less to get
         wrong!
      
      4. In ArgInfo, arrange that ai_dmds and ai_discs have a simpler
         invariant: they line up with the value arguments beyond ai_args
         This allowed a bit of nice refactoring; see isStrictArgInfo,
         lazyArgcontext, strictArgContext
      
      There is virtually no difference in nofib. (The runtime numbers
      are bogus -- I tried a few manually.)
      
              Program           Size    Allocs   Runtime   Elapsed  TotalMem
      --------------------------------------------------------------------------------
                  fft          +0.0%     -2.0%    -48.3%    -49.4%      0.0%
           multiplier          +0.0%     -2.2%    -50.3%    -50.9%      0.0%
      --------------------------------------------------------------------------------
                  Min          -0.4%     -2.2%    -59.2%    -60.4%      0.0%
                  Max          +0.0%     +0.1%     +3.3%     +4.9%      0.0%
       Geometric Mean          +0.0%     -0.0%    -33.2%    -34.3%     -0.0%
      
      Test T18282 is an existing example of these deeply-nested strict calls.
      We get a big decrease in compile time (-85%) because so much less
      inlining takes place.
      
      Metric Decrease:
          T18282
      0bd60059
  21. Jul 23, 2020
  22. Jul 13, 2020
    • Simon Peyton Jones's avatar
      Reduce result discount in conSize · 7ccb760b
      Simon Peyton Jones authored
      Ticket #18282 showed that the result discount given by conSize
      was massively too large.  This patch reduces that discount to
      a constant 10, which just balances the cost of the constructor
      application itself.
      
      Note [Constructor size and result discount] elaborates, as
      does the ticket #18282.
      
      Reducing result discount reduces inlining, which affects perf.  I
      found that I could increase the unfoldingUseThrehold from 80 to 90 in
      compensation; in combination with the result discount change I get
      these overall nofib numbers:
      
              Program           Size    Allocs   Runtime   Elapsed  TotalMem
      --------------------------------------------------------------------------------
                boyer          -0.2%     +5.4%     -3.2%     -3.4%      0.0%
             cichelli          -0.1%     +5.9%    -11.2%    -11.7%      0.0%
            compress2          -0.2%     +9.6%     -6.0%     -6.8%      0.0%
         cryptarithm2          -0.1%     -3.9%     -6.0%     -5.7%      0.0%
               gamteb          -0.2%     +2.6%    -13.8%    -14.4%      0.0%
               genfft          -0.1%     -1.6%    -29.5%    -29.9%      0.0%
                   gg          -0.0%     -2.2%    -17.2%    -17.8%    -20.0%
                 life          -0.1%     -2.2%    -62.3%    -63.4%      0.0%
                 mate          +0.0%     +1.4%     -5.1%     -5.1%    -14.3%
               parser          -0.2%     -2.1%     +7.4%     +6.7%      0.0%
            primetest          -0.2%    -12.8%    -14.3%    -14.2%      0.0%
               puzzle          -0.2%     +2.1%    -10.0%    -10.4%      0.0%
                  rsa          -0.2%    -11.7%     -3.7%     -3.8%      0.0%
               simple          -0.2%     +2.8%    -36.7%    -38.3%     -2.2%
         wheel-sieve2          -0.1%    -19.2%    -48.8%    -49.2%    -42.9%
      --------------------------------------------------------------------------------
                  Min          -0.4%    -19.2%    -62.3%    -63.4%    -42.9%
                  Max          +0.3%     +9.6%     +7.4%    +11.0%    +16.7%
       Geometric Mean          -0.1%     -0.3%    -17.6%    -18.0%     -0.7%
      
      I'm ok with these numbers, remembering that this change removes
      an *exponential* increase in code size in some in-the-wild cases.
      
      I investigated compress2.  The difference is entirely caused by this
      function no longer inlining
      
      WriteRoutines.$woutputCodes
        = \ (w :: [CodeEvent]) ->
            let result_s1Sr
                  = case WriteRoutines.outputCodes_$s$woutput w 0# 0# 8# 9# of
                      (# ww1, ww2 #) -> (ww1, ww2)
            in (# case result_s1Sr of (x, _) ->
                    map @Int @Char WriteRoutines.outputCodes1 x
               , case result_s1Sr of { (_, y) -> y } #)
      
      It was right on the cusp before, driven by the excessive result
      discount.  Too bad!
      
      Happily, the compiler/perf tests show a number of improvements:
          T12227     compiler bytes-alloc  -6.6%
          T12545     compiler bytes-alloc  -4.7%
          T13056     compiler bytes-alloc  -3.3%
          T15263     runtime  bytes-alloc -13.1%
          T17499     runtime  bytes-alloc -14.3%
          T3294      compiler bytes-alloc  -1.1%
          T5030      compiler bytes-alloc -11.7%
          T9872a     compiler bytes-alloc  -2.0%
          T9872b     compiler bytes-alloc  -1.2%
          T9872c     compiler bytes-alloc  -1.5%
      
      Metric Decrease:
          T12227
          T12545
          T13056
          T15263
          T17499
          T3294
          T5030
          T9872a
          T9872b
          T9872c
      7ccb760b
  23. Jun 10, 2020
    • Simon Peyton Jones's avatar
      Implement cast worker/wrapper properly · 6d49d5be
      Simon Peyton Jones authored and Marge Bot's avatar Marge Bot committed
      The cast worker/wrapper transformation transforms
         x = e |> co
      into
         y = e
         x = y |> co
      
      This is done by the simplifier, but we were being
      careless about transferring IdInfo from x to y,
      and about what to do if x is a NOINLNE function.
      This resulted in a series of bugs:
           #17673, #18093, #18078.
      
      This patch fixes all that:
      
      * Main change is in GHC.Core.Opt.Simplify, and
        the new prepareBinding function, which does this
        cast worker/wrapper transform.
        See Note [Cast worker/wrappers].
      
      * There is quite a bit of refactoring around
        prepareRhs, makeTrivial etc.  It's nicer now.
      
      * Some wrappers from strictness and cast w/w, notably those for
        a function with a NOINLINE, should inline very late. There
        wasn't really a mechanism for that, which was an existing bug
        really; so I invented a new finalPhase = Phase (-1).  It's used
        for all simplifier runs after the user-visible phase 2,1,0 have
        run.  (No new runs of the simplifier are introduced thereby.)
      
        See new Note [Compiler phases] in GHC.Types.Basic;
        the main changes are in GHC.Core.Opt.Driver
      
      * Doing this made me trip over two places where the AnonArgFlag on a
        FunTy was being lost so we could end up with (Num a -> ty)
        rather than (Num a => ty)
          - In coercionLKind/coercionRKind
          - In contHoleType in the Simplifier
      
        I fixed the former by defining mkFunctionType and using it in
        coercionLKind/RKind.
      
        I could have done the same for the latter, but the information
        is almost to hand.  So I fixed the latter by
          - adding sc_hole_ty to ApplyToVal (like ApplyToTy),
          - adding as_hole_ty to ValArg (like TyArg)
          - adding sc_fun_ty to StrictArg
        Turned out I could then remove ai_type from ArgInfo.  This is
        just moving the deck chairs around, but it worked out nicely.
      
        See the new Note [AnonArgFlag] in GHC.Types.Var
      
      * When looking at the 'arity decrease' thing (#18093) I discovered
        that stable unfoldings had a much lower arity than the actual
        optimised function.  That's what led to the arity-decrease
        message.  Simple solution: eta-expand.
      
        It's described in Note [Eta-expand stable unfoldings]
        in GHC.Core.Opt.Simplify
      
      * I also discovered that unsafeCoerce wasn't being inlined if
        the context was boring.  So (\x. f (unsafeCoerce x)) would
        create a thunk -- yikes!  I fixed that by making inlineBoringOK
        a bit cleverer: see Note [Inline unsafeCoerce] in GHC.Core.Unfold.
      
        I also found that unsafeCoerceName was unused, so I removed it.
      
      I made a test case for #18078, and a very similar one for #17673.
      
      The net effect of all this on nofib is very modest, but positive:
      
      --------------------------------------------------------------------------------
              Program           Size    Allocs   Runtime   Elapsed  TotalMem
      --------------------------------------------------------------------------------
                 anna          -0.4%     -0.1%     -3.1%     -3.1%      0.0%
       fannkuch-redux          -0.4%     -0.3%     -0.1%     -0.1%      0.0%
             maillist          -0.4%     -0.1%     -7.8%     -1.0%    -14.3%
            primetest          -0.4%    -15.6%     -7.1%     -6.6%      0.0%
      --------------------------------------------------------------------------------
                  Min          -0.9%    -15.6%    -13.3%    -14.2%    -14.3%
                  Max          -0.3%      0.0%    +12.1%    +12.4%      0.0%
       Geometric Mean          -0.4%     -0.2%     -2.3%     -2.2%     -0.1%
      
      All following metric decreases are compile-time allocation decreases
      between -1% and -3%:
      
      Metric Decrease:
        T5631
        T13701
        T14697
        T15164
      6d49d5be
  24. May 13, 2020
    • Sebastian Graf's avatar
      CprAnal: Don't attach CPR sigs to expandable bindings (#18154) · 86d8ac22
      Sebastian Graf authored and Marge Bot's avatar Marge Bot committed
      Instead, look through expandable unfoldings in `cprTransform`.
      See the new Note [CPR for expandable unfoldings]:
      
      ```
      Long static data structures (whether top-level or not) like
      
        xs = x1 : xs1
        xs1 = x2 : xs2
        xs2 = x3 : xs3
      
      should not get CPR signatures, because they
      
        * Never get WW'd, so their CPR signature should be irrelevant after analysis
          (in fact the signature might even be harmful for that reason)
        * Would need to be inlined/expanded to see their constructed product
        * Recording CPR on them blows up interface file sizes and is redundant with
          their unfolding. In case of Nested CPR, this blow-up can be quadratic!
      
      But we can't just stop giving DataCon application bindings the CPR property,
      for example
      
        fac 0 = 1
        fac n = n * fac (n-1)
      
      fac certainly has the CPR property and should be WW'd! But FloatOut will
      transform the first clause to
      
        lvl = 1
        fac 0 = lvl
      
      If lvl doesn't have the CPR property, fac won't either. But lvl doesn't have a
      CPR signature to extrapolate into a CPR transformer ('cprTransform'). So
      instead we keep on cprAnal'ing through *expandable* unfoldings for these arity
      0 bindings via 'cprExpandUnfolding_maybe'.
      
      In practice, GHC generates a lot of (nested) TyCon and KindRep bindings, one
      for each data declaration. It's wasteful to attach CPR signatures to each of
      them (and intractable in case of Nested CPR).
      ```
      
      Fixes #18154.
      86d8ac22
  25. Feb 12, 2020
  26. Jan 31, 2020
    • Ömer Sinan Ağacan's avatar
      Do CafInfo/SRT analysis in Cmm · c846618a
      Ömer Sinan Ağacan authored
      This patch removes all CafInfo predictions and various hacks to preserve
      predicted CafInfos from the compiler and assigns final CafInfos to
      interface Ids after code generation. SRT analysis is extended to support
      static data, and Cmm generator is modified to allow generating
      static_link fields after SRT analysis.
      
      This also fixes `-fcatch-bottoms`, which introduces error calls in case
      expressions in CorePrep, which runs *after* CoreTidy (which is where we
      decide on CafInfos) and turns previously non-CAFFY things into CAFFY.
      
      Fixes #17648
      Fixes #9718
      
      Evaluation
      ==========
      
      NoFib
      -----
      
      Boot with: `make boot mode=fast`
      Run: `make mode=fast EXTRA_RUNTEST_OPTS="-cachegrind" NoFibRuns=1`
      
      --------------------------------------------------------------------------------
              Program           Size    Allocs    Instrs     Reads    Writes
      --------------------------------------------------------------------------------
                   CS          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                  CSD          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                   FS          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                    S          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                   VS          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                  VSD          -0.0%      0.0%     -0.0%     -0.0%     -0.5%
                  VSM          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                 anna          -0.1%      0.0%     -0.0%     -0.0%     -0.0%
                 ansi          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                 atom          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
               awards          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
               banner          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           bernouilli          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         binary-trees          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                boyer          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
               boyer2          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                 bspt          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            cacheprof          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
             calendar          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
             cichelli          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
              circsim          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
             clausify          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
        comp_lab_zift          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
             compress          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            compress2          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
          constraints          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         cryptarithm1          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         cryptarithm2          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                  cse          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         digits-of-e1          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         digits-of-e2          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
               dom-lt          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                eliza          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                event          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
          exact-reals          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
               exp3_8          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
               expert          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
       fannkuch-redux          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                fasta          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                  fem          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                  fft          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                 fft2          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
             fibheaps          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                 fish          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                fluid          -0.1%      0.0%     -0.0%     -0.0%     -0.0%
               fulsom          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
               gamteb          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                  gcd          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
          gen_regexps          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
               genfft          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                   gg          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                 grep          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
               hidden          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                  hpg          -0.1%      0.0%     -0.0%     -0.0%     -0.0%
                  ida          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                infer          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
              integer          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            integrate          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         k-nucleotide          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                kahan          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
              knights          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
               lambda          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           last-piece          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                 lcss          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                 life          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                 lift          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
               linear          -0.1%      0.0%     -0.0%     -0.0%     -0.0%
            listcompr          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
             listcopy          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
             maillist          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
               mandel          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
              mandel2          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                 mate          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
              minimax          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
              mkhprog          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           multiplier          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
               n-body          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
             nucleic2          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                 para          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            paraffins          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
               parser          -0.1%      0.0%     -0.0%     -0.0%     -0.0%
              parstof          -0.1%      0.0%     -0.0%     -0.0%     -0.0%
                  pic          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
             pidigits          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                power          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
               pretty          -0.0%      0.0%     -0.3%     -0.4%     -0.4%
               primes          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            primetest          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
               prolog          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
               puzzle          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
               queens          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
              reptile          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
      reverse-complem          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
              rewrite          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                 rfib          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                  rsa          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                  scc          -0.0%      0.0%     -0.3%     -0.5%     -0.4%
                sched          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                  scs          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
               simple          -0.1%      0.0%     -0.0%     -0.0%     -0.0%
                solid          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
              sorting          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
        spectral-norm          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
               sphere          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
               symalg          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                  tak          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            transform          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
             treejoin          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            typecheck          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
              veritas          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                 wang          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            wave4main          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         wheel-sieve1          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         wheel-sieve2          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                 x2n1          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
      --------------------------------------------------------------------------------
                  Min          -0.1%      0.0%     -0.3%     -0.5%     -0.5%
                  Max          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
       Geometric Mean          -0.0%     -0.0%     -0.0%     -0.0%     -0.0%
      
      --------------------------------------------------------------------------------
              Program           Size    Allocs    Instrs     Reads    Writes
      --------------------------------------------------------------------------------
              circsim          -0.1%      0.0%     -0.0%     -0.0%     -0.0%
          constraints          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
             fibheaps          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
             gc_bench          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                 hash          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                 lcss          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
                power          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           spellcheck          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
      --------------------------------------------------------------------------------
                  Min          -0.1%      0.0%     -0.0%     -0.0%     -0.0%
                  Max          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
       Geometric Mean          -0.0%     +0.0%     -0.0%     -0.0%     -0.0%
      
      Manual inspection of programs in testsuite/tests/programs
      ---------------------------------------------------------
      
      I built these programs with a bunch of dump flags and `-O` and compared
      STG, Cmm, and Asm dumps and file sizes.
      
      (Below the numbers in parenthesis show number of modules in the program)
      
      These programs have identical compiler (same .hi and .o sizes, STG, and
      Cmm and Asm dumps):
      
      - Queens (1), andre_monad (1), cholewo-eval (2), cvh_unboxing (3),
        andy_cherry (7), fun_insts (1), hs-boot (4), fast2haskell (2),
        jl_defaults (1), jq_readsPrec (1), jules_xref (1), jtod_circint (4),
        jules_xref2 (1), lennart_range (1), lex (1), life_space_leak (1),
        bargon-mangler-bug (7), record_upd (1), rittri (1), sanders_array (1),
        strict_anns (1), thurston-module-arith (2), okeefe_neural (1),
        joao-circular (6), 10queens (1)
      
      Programs with different compiler outputs:
      
      - jl_defaults (1): For some reason GHC HEAD marks a lot of top-level
        `[Int]` closures as CAFFY for no reason. With this patch we no longer
        make them CAFFY and generate less SRT entries. For some reason Main.o
        is slightly larger with this patch (1.3%) and the executable sizes are
        the same. (I'd expect both to be smaller)
      
      - launchbury (1): Same as jl_defaults: top-level `[Int]` closures marked
        as CAFFY for no reason. Similarly `Main.o` is 1.4% larger but the
        executable sizes are the same.
      
      - galois_raytrace (13): Differences are in the Parse module. There are a
        lot, but some of the changes are caused by the fact that for some
        reason (I think a bug) GHC HEAD marks the dictionary for `Functor
        Identity` as CAFFY. Parse.o is 0.4% larger, the executable size is the
        same.
      
      - north_array: We now generate less SRT entries because some of array
        primops used in this program like `NewArrayOp` get eliminated during
        Stg-to-Cmm and turn some CAFFY things into non-CAFFY. Main.o gets 24%
        larger (9224 bytes from 9000 bytes), executable sizes are the same.
      
      - seward-space-leak: Difference in this program is better shown by this
        smaller example:
      
            module Lib where
      
            data CDS
              = Case [CDS] [(Int, CDS)]
              | Call CDS CDS
      
            instance Eq CDS where
              Case sels1 rets1 == Case sels2 rets2 =
                  sels1 == sels2 && rets1 == rets2
              Call a1 b1 == Call a2 b2 =
                  a1 == a2 && b1 == b2
              _ == _ =
                  False
      
         In this program GHC HEAD builds a new SRT for the recursive group of
         `(==)`, `(/=)` and the dictionary closure. Then `/=` points to `==`
         in its SRT field, and `==` uses the SRT object as its SRT. With this
         patch we use the closure for `/=` as the SRT and add `==` there. Then
         `/=` gets an empty SRT field and `==` points to `/=` in its SRT
         field.
      
         This change looks fine to me.
      
         Main.o gets 0.07% larger, executable sizes are identical.
      
      head.hackage
      ------------
      
      head.hackage's CI script builds 428 packages from Hackage using this
      patch with no failures.
      
      Compiler performance
      --------------------
      
      The compiler perf tests report that the compiler allocates slightly more
      (worst case observed so far is 4%). However most programs in the test
      suite are small, single file programs. To benchmark compiler performance
      on something more realistic I build Cabal (the library, 236 modules)
      with different optimisation levels. For the "max residency" row I run
      GHC with `+RTS -s -A100k -i0 -h` for more accurate numbers. Other rows
      are generated with just `-s`. (This is because `-i0` causes running GC
      much more frequently and as a result "bytes copied" gets inflated by
      more than 25x in some cases)
      
      * -O0
      
      |                 | GHC HEAD       | This MR        | Diff   |
      | --------------- | -------------- | -------------- | ------ |
      | Bytes allocated | 54,413,350,872 | 54,701,099,464 | +0.52% |
      | Bytes copied    |  4,926,037,184 |  4,990,638,760 | +1.31% |
      | Max residency   |    421,225,624 |    424,324,264 | +0.73% |
      
      * -O1
      
      |                 | GHC HEAD        | This MR         | Diff   |
      | --------------- | --------------- | --------------- | ------ |
      | Bytes allocated | 245,849,209,992 | 246,562,088,672 | +0.28% |
      | Bytes copied    |  26,943,452,560 |  27,089,972,296 | +0.54% |
      | Max residency   |     982,643,440 |     991,663,432 | +0.91% |
      
      * -O2
      
      |                 | GHC HEAD        | This MR         | Diff   |
      | --------------- | --------------- | --------------- | ------ |
      | Bytes allocated | 291,044,511,408 | 291,863,910,912 | +0.28% |
      | Bytes copied    |  37,044,237,616 |  36,121,690,472 | -2.49% |
      | Max residency   |   1,071,600,328 |   1,086,396,256 | +1.38% |
      
      Extra compiler allocations
      --------------------------
      
      Runtime allocations of programs are as reported above (NoFib section).
      
      The compiler now allocates more than before. Main source of allocation
      in this patch compared to base commit is the new SRT algorithm
      (GHC.Cmm.Info.Build). Below is some of the extra work we do with this
      patch, numbers generated by profiled stage 2 compiler when building a
      pathological case (the test 'ManyConstructors') with '-O2':
      
      - We now sort the final STG for a module, which means traversing the
        entire program, generating free variable set for each top-level
        binding, doing SCC analysis, and re-ordering the program. In
        ManyConstructors this step allocates 97,889,952 bytes.
      
      - We now do SRT analysis on static data, which in a program like
        ManyConstructors causes analysing 10,000 bindings that we would
        previously just skip. This step allocates 70,898,352 bytes.
      
      - We now maintain an SRT map for the entire module as we compile Cmm
        groups:
      
            data ModuleSRTInfo = ModuleSRTInfo
              { ...
              , moduleSRTMap :: SRTMap
              }
      
         (SRTMap is just a strict Map from the 'containers' library)
      
         This map gets an entry for most bindings in a module (exceptions are
         THUNKs and CAFFY static functions). For ManyConstructors this map
         gets 50015 entries.
      
      - Once we're done with code generation we generate a NameSet from SRTMap
        for the non-CAFFY names in the current module. This set gets the same
        number of entries as the SRTMap.
      
      - Finally we update CafInfos in ModDetails for the non-CAFFY Ids, using
        the NameSet generated in the previous step. This usually does the
        least amount of allocation among the work listed here.
      
      Only place with this patch where we do less work in the CAF analysis in
      the tidying pass (CoreTidy). However that doesn't save us much, as the
      pass still needs to traverse the whole program and update IdInfos for
      other reasons. Only thing we don't here do is the `hasCafRefs` pass over
      the RHS of bindings, which is a stateless pass that returns a boolean
      value, so it doesn't allocate much.
      
      (Metric changes blow are all increased allocations)
      
      Metric changes
      --------------
      
      Metric Increase:
          ManyAlternatives
          ManyConstructors
          T13035
          T14683
          T1969
          T9961
      c846618a
  27. Jan 08, 2020
    • Ryan Scott's avatar
      Print Core type applications with no whitespace after @ (#17643) · 923a1272
      Ryan Scott authored and Marge Bot's avatar Marge Bot committed
      This brings the pretty-printer for Core in line with how visible
      type applications are normally printed: namely, with no whitespace
      after the `@` character (i.e., `f @a` instead of `f @ a`). While I'm
      in town, I also give the same treatment to type abstractions (i.e.,
      `\(@a)` instead of `\(@ a)`) and coercion applications (i.e.,
      `f @~x` instead of `f @~ x`).
      
      Fixes #17643.
      923a1272
  28. Dec 12, 2018
    • Simon Peyton Jones's avatar
      Improvements to demand analysis · d77501cd
      Simon Peyton Jones authored
      This patch collects a few improvements triggered by Trac #15696,
      and fixing Trac #16029
      
      * Stop making toCleanDmd behave specially for unlifted types.
        This special case was the cause of stupid behaviour in Trac
        #16029.  And to my joy I discovered the let/app invariant
        rendered it unnecessary.  (Maybe the special case pre-dated
        the let/app invariant.)
      
        Result: less special-case handling in the compiler, and
        better perf for the compiled code.
      
      * In WwLib.mkWWstr_one, treat seqDmd like U(AAA).  It was not
        being so treated before, which again led to stupid code.
      
      * Update and improve Notes
      
      There are .stderr test wibbles because we get slightly different
      strictness signatures for an argumment of unlifted type:
          <L,U> rather than <S,U>        for Int#
          <S,U> rather than <S(S),U(U)>  for Int
      d77501cd
  29. Jun 07, 2018
    • Simon Peyton Jones's avatar
      Remove ad-hoc special case in occAnal · c16382d5
      Simon Peyton Jones authored
      Back in 1999 I put this ad-hoc code in the Case-handling
      code for occAnal:
      
        occAnal env (Case scrut bndr ty alts)
         = ...
              -- Note [Case binder usage]
              -- ~~~~~~~~~~~~~~~~~~~~~~~~
              -- The case binder gets a usage of either "many" or "dead", never "one".
              -- Reason: we like to inline single occurrences, to eliminate a binding,
              -- but inlining a case binder *doesn't* eliminate a binding.
              -- We *don't* want to transform
              --      case x of w { (p,q) -> f w }
              -- into
              --      case x of w { (p,q) -> f (p,q) }
          tag_case_bndr usage bndr
            = (usage', setIdOccInfo bndr final_occ_info)
            where
              occ_info       = lookupDetails usage bndr
              usage'         = usage `delDetails` bndr
              final_occ_info = case occ_info of IAmDead -> IAmDead
                                                _       -> noOccInfo
      
      But the comment looks wrong -- the bad inlining will not happen -- and
      I think it relates to some long-ago version of the simplifier.
      
      So I simply removed the special case, which gives more accurate
      occurrence-info to the case binder.  Interestingly I got a slight
      improvement in nofib binary sizes.
      
      --------------------------------------------------------------------------------
              Program           Size    Allocs   Runtime   Elapsed  TotalMem
      --------------------------------------------------------------------------------
            cacheprof          -0.1%     +0.2%     -0.7%     -1.2%     +8.6%
      --------------------------------------------------------------------------------
                  Min          -0.2%      0.0%    -14.5%    -30.5%      0.0%
                  Max          -0.1%     +0.2%    +10.0%    +10.0%    +25.0%
       Geometric Mean          -0.2%     +0.0%     -1.9%     -5.4%     +0.3%
      
      I have no idea if the improvement in runtime is real.  I did look at the
      tiny increase in allocation for cacheprof and concluded that it was
      unimportant (I forget the details).
      
      Also the more accurate occ-info for the case binder meant that some
      inlining happens in one pass that previously took successive passes
      for the test dependent/should_compile/dynamic-paper (which has a
      known Russel-paradox infinite loop in the simplifier).
      
      In short, a small win: less ad-hoc complexity and slightly smaller
      binaries.
      c16382d5
  30. Apr 20, 2018
    • Simon Peyton Jones's avatar
      Inline wrappers earlier · 8b10b896
      Simon Peyton Jones authored
      This patch has a single significant change:
      
        strictness wrapper functions are inlined earlier,
        in phase 2 rather than phase 0.
      
      As shown by Trac #15056, this gives a better chance for RULEs to fire.
      Before this change, a function that would have inlined early without
      strictness analyss was instead inlining late. Result: applying
      "optimisation" made the program worse.
      
      This does not make too much difference in nofib, but I've stumbled
      over the problem more than once, so even a "no-change" result would be
      quite acceptable.  Here are the headlines:
      
      --------------------------------------------------------------------------------
              Program           Size    Allocs   Runtime   Elapsed  TotalMem
      --------------------------------------------------------------------------------
            cacheprof          -0.5%     -0.5%     +2.5%     +2.5%      0.0%
               fulsom          -1.0%     +2.6%     -0.1%     -0.1%      0.0%
                 mate          -0.6%     +2.4%     -0.9%     -0.9%      0.0%
              veritas          -0.7%    -23.2%     0.002     0.002      0.0%
      --------------------------------------------------------------------------------
                  Min          -1.4%    -23.2%    -12.5%    -15.3%      0.0%
                  Max          +0.6%     +2.6%     +4.4%     +4.3%    +19.0%
       Geometric Mean          -0.7%     -0.2%     -1.4%     -1.7%     +0.2%
      
      * A worthwhile reduction in binary size.
      
      * Runtimes are not to be trusted much but look as if they
        are moving the right way.
      
      * A really big win in veritas, described in comment:1 of
        Trac #15056; more fusion rules fired.
      
      * I investigated the losses in 'mate' and 'fulsom'; see #15056.
      8b10b896
  31. Jan 03, 2018
    • Simon Peyton Jones's avatar
      Get evaluated-ness right in the back end · bd438b2d
      Simon Peyton Jones authored
      See Trac #14626, comment:4.  We want to maintain evaluted-ness
      info on Ids into the code generateor for two reasons
      (see Note [Preserve evaluated-ness in CorePrep] in CorePrep)
      
      - DataToTag magic
      - Potentially using it in the codegen (this is Gabor's
        current work)
      
      But it was all being done very inconsistently, and actually
      outright wrong -- the DataToTag magic hasn't been working for
      years.
      
      This patch tidies it all up, with Notes to match.
      bd438b2d
  32. Sep 12, 2017
  33. Mar 06, 2017
    • Simon Peyton Jones's avatar
      Make FloatOut/SetLevels idemoptent on bottoming functions · fb9ae288
      Simon Peyton Jones authored
      This fixes Trac #13369.  It turned out that I really had got the
      bottoming-float code wrong, again.  The new story is explained in
      Note [Bottoming floats], esp item (3), and Note [Floating from a RHS].
      
      I didn't make a regression test; it's hard to to so.
      
      Nofib result are good
      
      --------------------------------------------------------------------------------
              Program           Size    Allocs   Runtime   Elapsed  TotalMem
      --------------------------------------------------------------------------------
               banner          -2.2%     -4.6%      0.00      0.00     +0.0%
                 bspt          -1.3%     -1.6%      0.01      0.01     +0.0%
            cacheprof          -1.8%     -0.3%     +3.7%     +3.7%     -0.9%
         digits-of-e2          -1.0%     -1.5%     -0.5%     -0.5%     +0.0%
               expert          -1.3%     -0.2%      0.00      0.00     +0.0%
               n-body          -1.1%     -0.2%     +0.1%     +0.1%     +0.0%
              veritas          -2.9%     -0.1%      0.00      0.00     +0.0%
      --------------------------------------------------------------------------------
                  Min          -2.9%     -4.6%     -7.4%     -7.4%    -19.8%
                  Max          -1.0%     +0.0%     +5.2%     +5.1%    +10.0%
       Geometric Mean          -1.2%     -0.1%     +0.5%     +0.5%     -0.1%
      
      I /think/ all this is due to this error-floating change; but it's possible
      that some was due to commit "Fix CSE (again) on literal strings" a couple
      of commits earlier.
      fb9ae288
  34. Mar 02, 2017
  35. Feb 06, 2017
    • Eric Seidel's avatar
      Do Worker/Wrapper for NOINLINE things · b572aadb
      Eric Seidel authored and Ben Gamari's avatar Ben Gamari committed
      Disabling worker/wrapper for NOINLINE things can cause unnecessary
      reboxing of values. Consider
      
          {-# NOINLINE f #-}
          f :: Int -> a
          f x = error (show x)
      
          g :: Bool -> Bool -> Int -> Int
          g True  True  p = f p
          g False True  p = p + 1
          g b     False p = g b True p
      
      the strictness analysis will discover f and g are strict, but because f
      has no wrapper, the worker for g will rebox p. So we get
      
          $wg x y p# =
            let p = I# p# in  -- Yikes! Reboxing!
            case x of
              False ->
                case y of
                  False -> $wg False True p#
                  True -> +# p# 1#
              True ->
                case y of
                  False -> $wg True True p#
                  True -> case f p of { }
      
          g x y p = case p of (I# p#) -> $wg x y p#
      
      Now, in this case the reboxing will float into the True branch, an so
      the allocation will only happen on the error path. But it won't float
      inwards if there are multiple branches that call (f p), so the reboxing
      will happen on every call of g. Disaster.
      
      Solution: do worker/wrapper even on NOINLINE things; but move the
      NOINLINE pragma to the worker.
      
      Test Plan: make test TEST="13143"
      
      Reviewers: simonpj, bgamari, dfeuer, austin
      
      Reviewed By: simonpj, bgamari
      
      Subscribers: dfeuer, thomie
      
      Differential Revision: https://phabricator.haskell.org/D3046
      b572aadb
Loading