Commits · 6c613c902c02ee2372f7a65e873b1697196bcaa6 · Reinier Maas / GHC

Jan 08, 2024
- Move testsuite/tests/stranal to testsuite/tests/dmdanal · 6c613c90
  Sebastian Graf authored 1 year ago and Marge Bot committed 1 year ago
  
  A separate commit so that the rename is obvious to Git(Lab)
  6c613c90
Aug 09, 2023
- Core.Ppr: Omit case binder for empty case alternatives · 8c73505e
  Sebastian Graf authored 1 year ago and Marge Bot committed 1 year ago
  
  A minor improvement to pretty-printing
  8c73505e
Apr 26, 2023

DmdAnal: Unleash demand signatures of free RULE and unfolding binders (#23208) · c30ac25f

Sebastian Graf authored 1 year ago and

Marge Bot committed 1 year ago

In #23208 we observed that the demand signature of a binder occuring in a RULE
wasn't unleashed, leading to a transitively used binder being discarded as
absent. The solution was to use the same code path that we already use for
handling exported bindings.

See the changes to `Note [Absence analysis for stable unfoldings and RULES]`
for more details.

I took the chance to factor out the old notion of a `PlusDmdArg` (a pair of a
`VarEnv Demand` and a `Divergence`) into `DmdEnv`, which fits nicely into our
existing framework. As a result, I had to touch quite a few places in the code.

This refactoring exposed a few small bugs around correct handling of bottoming
demand environments. As a result, some strictness signatures now mention uniques
that weren't there before which caused test output changes to T13143, T19969 and
T22112. But these tests compared whole -ddump-simpl listings which is a very
fragile thing to begin with. I changed what exactly they test for based on the
symptoms in the corresponding issues.

There is a single regression in T18894 because we are more conservative around
stable unfoldings now. Unfortunately it is not easily fixed; let's wait until
there is a concrete motivation before invest more time.

Fixes #23208.

c30ac25f

Jan 11, 2023

Fix void-arg-adding mechanism for worker/wrapper · 964284fc

Simon Peyton Jones authored 2 years ago and

Marge Bot committed 2 years ago

As #22725 shows, in worker/wrapper we must add the void argument
/last/, not first.  See GHC.Core.Opt.WorkWrap.Utils
Note [Worker/wrapper needs to add void arg last].

That led me to to study GHC.Core.Opt.SpecConstr
Note [SpecConstr needs to add void args first] which suggests the
opposite!  And indeed I think it's the other way round for SpecConstr
-- or more precisely the void arg must precede the "extra_bndrs".

That led me to some refactoring of GHC.Core.Opt.SpecConstr.calcSpecInfo.

964284fc

Sep 28, 2022

Refactor UnfoldingSource and IfaceUnfolding · addeefc0

Simon Peyton Jones authored 2 years ago and

Marge Bot committed 2 years ago

I finally got tired of the way that IfaceUnfolding reflected
a previous structure of unfoldings, not the current one. This
MR refactors UnfoldingSource and IfaceUnfolding to be simpler
and more consistent.

It's largely just a refactor, but in UnfoldingSource (which moves
to GHC.Types.Basic, since it is now used in IfaceSyn too), I
distinguish between /user-specified/ and /system-generated/ stable
unfoldings.

    data UnfoldingSource
      = VanillaSrc
      | StableUserSrc   -- From a user-specified pragma
      | StableSystemSrc -- From a system-generated unfolding
      | CompulsorySrc

This has a minor effect in CSE (see the use of isisStableUserUnfolding
in GHC.Core.Opt.CSE), which I tripped over when working on
specialisation, but it seems like a Good Thing to know anyway.

addeefc0

Jun 27, 2022

Don't mark lambda binders as OtherCon · ac7a7fc8

Andreas Klebinger authored 2 years ago and

Marge Bot committed 2 years ago

We used to put OtherCon unfoldings on lambda binders of workers
and sometimes also join points/specializations with with the
assumption that since the wrapper would force these arguments
once we execute the RHS they would indeed be in WHNF.

This was wrong for reasons detailed in #21472. So now we purge
evaluated unfoldings from *all* lambda binders.

This fixes #21472, but at the cost of sometimes not using as efficient a
calling convention. It can also change inlining behaviour as some
occurances will no longer look like value arguments when they did
before.

As consequence we also change how we compute CBV information for
arguments slightly. We now *always* determine the CBV convention
for arguments during tidy. Earlier in the pipeline we merely mark
functions as candidates for having their arguments treated as CBV.

As before the process is described in the relevant notes:
Note [CBV Function Ids]
Note [Attaching CBV Marks to ids]
Note [Never put `OtherCon` unfoldigns on lambda binders]

-------------------------
Metric Decrease:
    T12425
    T13035
    T18223
    T18223
    T18923
    MultiLayerModulesTH_OneShot
Metric Increase:
    WWRec
-------------------------

ac7a7fc8

May 05, 2022

SpecConstr: Properly create rules for call patterns representing partial applications · 61901b32

Andreas Klebinger authored 2 years ago and

Marge Bot committed 2 years ago

The main fix is that in addVoidWorkerArg we now add the argument to the front.

This fixes #21448.

-------------------------
Metric Decrease:
    T16875
-------------------------

61901b32

Mar 16, 2022

Demand: Let `Boxed` win in `lubBoxity` (#21119) · 1575c4a5

Sebastian Graf authored 3 years ago and

Marge Bot committed 3 years ago

Previously, we let `Unboxed` win in `lubBoxity`, which is unsoundly optimistic
in terms ob Boxity analysis. "Unsoundly" in the sense that we sometimes unbox
parameters that we better shouldn't unbox. Examples are #18907 and T19871.absent.

Until now, we thought that this hack pulled its weight becuase it worked around
some shortcomings of the phase separation between Boxity analysis and CPR
analysis. But it is a gross hack which caused regressions itself that needed all
kinds of fixes and workarounds. See for example #20767. It became impossible to
work with in !7599, so I want to remove it.

For example, at the moment, `lubDmd B dmd` will not unbox `dmd`,
but `lubDmd A dmd` will. Given that `B` is supposed to be the bottom element of
the lattice, it's hardly justifiable to get a better demand when `lub`bing with
`A`.

The consequence of letting `Boxed` win in `lubBoxity` is that we *would* regress
 #2387, #16040 and parts of #5075 and T19871.sumIO, until Boxity and CPR
are able to communicate better. Fortunately, that is not the case since I could
tweak the other source of optimism in Boxity analysis that is described in
`Note [Unboxed demand on function bodies returning small products]` so that
we *recursively* assume unboxed demands on function bodies returning small
products. See the updated Note.

`Note [Boxity for bottoming functions]` describes why we need bottoming
functions to have signatures that say that they deeply unbox their arguments.
In so doing, I had to tweak `finaliseArgBoxities` so that it will never unbox
recursive data constructors. This is in line with our handling of them in CPR.
I updated `Note [Which types are unboxed?]` to reflect that.

In turn we fix #21119, #20767, #18907, T19871.absent and get a much simpler
implementation (at least to think about). We can also drop the very ad-hoc
definition of `deferAfterPreciseException` and its Note in favor of the
simple, intuitive definition we used to have.

Metric Decrease:
    T16875
    T18223
    T18698a
    T18698b
    hard_hole_fits
Metric Increase:
    LargeRecord
    MultiComponentModulesRecomp
    T15703
    T8095
    T9872d

Out of all the regresions, only the one in T9872d doesn't vanish in a perf
build, where the compiler is bootstrapped with -O2 and thus SpecConstr.
Reason for regressions:

  * T9872d is due to `ty_co_subst` taking its `LiftingContext` boxed.
    That is because the context is passed to a function argument, for
    example in `liftCoSubstTyVarBndrUsing`.
  * In T15703, LargeRecord and T8095, we get a bit more allocations in
    `expand_syn` and `piResultTys`, because a `TCvSubst` isn't unboxed.
    In both cases that guards against reboxing in some code paths.
  * The same is true for MultiComponentModulesRecomp, where we get less unboxing
    in `GHC.Unit.Finder.$wfindInstalledHomeModule`. In a perf build, allocations
    actually *improve* by over 4%!

Results on NoFib:

--------------------------------------------------------------------------------
        Program         Allocs    Instrs
--------------------------------------------------------------------------------
         awards          -0.4%     +0.3%
      cacheprof          -0.3%     +2.4%
            fft          -1.5%     -5.1%
       fibheaps          +1.2%     +0.8%
          fluid          -0.3%     -0.1%
            ida          +0.4%     +0.9%
   k-nucleotide          +0.4%     -0.1%
     last-piece         +10.5%    +13.9%
           lift          -4.4%     +3.5%
        mandel2         -99.7%    -99.8%
           mate          -0.4%     +3.6%
         parser          -1.0%     +0.1%
         puzzle         -11.6%     +6.5%
reverse-complem          -3.0%     +2.0%
            scs          -0.5%     +0.1%
         sphere          -0.4%     -0.2%
      wave4main          -8.2%     -0.3%
--------------------------------------------------------------------------------
Summary excludes mandel2 because of excessive bias
            Min         -11.6%     -5.1%
            Max         +10.5%    +13.9%
 Geometric Mean          -0.2%     +0.3%
--------------------------------------------------------------------------------

Not bad for a bug fix.

The regression in `last-piece` could become a win if SpecConstr would work on
non-recursive functions. The regression in `fibheaps` is due to
`Note [Reboxed crud for bottoming calls]`, e.g., #21128.

1575c4a5

Feb 12, 2022

Tag inference work. · 0e93023e

Andreas Klebinger authored 3 years ago and

Matthew Pickering committed 3 years ago

This does three major things:
* Enforce the invariant that all strict fields must contain tagged
pointers.
* Try to predict the tag on bindings in order to omit tag checks.
* Allows functions to pass arguments unlifted (call-by-value).

The former is "simply" achieved by wrapping any constructor allocations with
a case which will evaluate the respective strict bindings.

The prediction is done by a new data flow analysis based on the STG
representation of a program. This also helps us to avoid generating
redudant cases for the above invariant.

StrictWorkers are created by W/W directly and SpecConstr indirectly.
See the Note [Strict Worker Ids]

Other minor changes:

* Add StgUtil module containing a few functions needed by, but
  not specific to the tag analysis.

-------------------------
Metric Decrease:
	T12545
	T18698b
	T18140
	T18923
        LargeRecord
Metric Increase:
        LargeRecord
	ManyAlternatives
	ManyConstructors
	T10421
	T12425
	T12707
	T13035
	T13056
	T13253
	T13253-spj
	T13379
	T15164
	T18282
	T18304
	T18698a
	T1969
	T20049
	T3294
	T4801
	T5321FD
	T5321Fun
	T783
	T9233
	T9675
	T9961
	T19695
	WWRec
-------------------------

0e93023e

Oct 24, 2021

DmdAnal: Implement Boxity Analysis (#19871) · 3bab222c

Sebastian Graf authored 3 years ago and

Marge Bot committed 3 years ago

This patch fixes some abundant reboxing of `DynFlags` in
`GHC.HsToCore.Match.Literal.warnAboutOverflowedLit` (which was the topic
of #19407) by introducing a Boxity analysis to GHC, done as part of demand
analysis. This allows to accurately capture ad-hoc unboxing decisions previously
made in worker/wrapper in demand analysis now, where the boxity info can
propagate through demand signatures.

See the new `Note [Boxity analysis]`. The actual fix for #19407 is described in
`Note [No lazy, Unboxed demand in demand signature]`, but
`Note [Finalising boxity for demand signature]` is probably a better entry-point.

To support the fix for #19407, I had to change (what was)
`Note [Add demands for strict constructors]` a bit
(now `Note [Unboxing evaluated arguments]`). In particular, we now take care of
it in `finaliseBoxity` (which is only called from demand analaysis) instead of
`wantToUnboxArg`.

I also had to resurrect `Note [Product demands for function body]` and rename
it to `Note [Unboxed demand on function bodies returning small products]` to
avoid huge regressions in `join004` and `join007`, thereby fixing #4267 again.
See the updated Note for details.

A nice side-effect is that the worker/wrapper transformation no longer needs to
look at strictness info and other bits such as `InsideInlineableFun` flags
(needed for `Note [Do not unbox class dictionaries]`) at all. It simply collects
boxity info from argument demands and interprets them with a severely simplified
`wantToUnboxArg`. All the smartness is in `finaliseBoxity`, which could be moved
to DmdAnal completely, if it wasn't for the call to `dubiousDataConInstArgTys`
which would be awkward to export.

I spent some time figuring out the reason for why `T16197` failed prior to my
amendments to `Note [Unboxing evaluated arguments]`. After having it figured
out, I minimised it a bit and added `T16197b`, which simply compares computed
strictness signatures and thus should be far simpler to eyeball.

The 12% ghc/alloc regression in T11545 is because of the additional `Boxity`
field in `Poly` and `Prod` that results in more allocation during `lubSubDmd`
and `plusSubDmd`. I made sure in the ticky profiles that the number of calls
to those functions stayed the same. We can bear such an increase here, as we
recently improved it by -68% (in b760c1f7).
T18698* regress slightly because there is more unboxing of dictionaries
happening and that causes Lint (mostly) to allocate more.

Fixes #19871, #19407, #4267, #16859, #18907 and #13331.

Metric Increase:
    T11545
    T18698a
    T18698b

Metric Decrease:
    T12425
    T16577
    T18223
    T18282
    T4267
    T9961

3bab222c

Sep 28, 2021
- Add `-dsuppress-core-sizes` flag (#20342) · 45a674aa
  Sylvain Henry authored 3 years ago and Marge Bot committed 3 years ago
  
  This flag is used to remove the output of core stats per binding in Core dumps.
  45a674aa
Sep 08, 2021
- Only dump Core stats when requested to do so (#20342) · 87d93745
  Sylvain Henry authored 3 years ago and Marge Bot committed 3 years ago
  
  87d93745
Jun 27, 2021

WorkWrap: Remove mkWWargs (#19874) · eee498bf

Sebastian Graf authored 3 years ago and

Marge Bot committed 3 years ago

`mkWWargs`'s job was pushing casts inwards and doing eta expansion to match
the arity with the number of argument demands we w/w for.

Nowadays, we use the Simplifier to eta expand to arity. In fact, in recent years
we have even seen the eta expansion done by w/w as harmful, see Note [Don't eta
expand in w/w]. If a function hasn't enough manifest lambdas, don't w/w it!

What purpose does `mkWWargs` serve in this world? Not a great one, it turns out!
I could remove it by pulling some important bits,
notably Note [Freshen WW arguments] and Note [Join points and beta-redexes].
Result: We reuse the freshened binder names of the wrapper in the
worker where possible (see testuite changes), much nicer!

In order to avoid scoping errors due to lambda-bound unfoldings in worker
arguments, we zap those unfoldings now. In doing so, we fix #19766.

Fixes #19874.

eee498bf

Apr 20, 2021

Worker/wrapper: Refactor CPR WW to work for nested CPR (#18174) · fdbead70

Sebastian Graf authored 4 years ago

In another small step towards bringing a manageable variant of Nested
CPR into GHC, this patch refactors worker/wrapper to be able to exploit
Nested CPR signatures. See the new Note [Worker/wrapper for CPR].

The nested code path is currently not triggered, though, because all
signatures that we annotate are still flat. So purely a refactoring.
I am very confident that it works, because I ripped it off !1866 95%
unchanged.

A few test case outputs changed, but only it's auxiliary names only.
I also added test cases for #18109 and #18401.

There's a 2.6% metric increase in T13056 after a rebase, caused by an
additional Simplifier run. It appears b1d0b9c saw a similar additional
iteration. I think it's just a fluke.

Metric Increase:
    T13056

fdbead70

Mar 20, 2021

Nested CPR light (#19398) · 044e5be3

Sebastian Graf authored 4 years ago and

Marge Bot committed 4 years ago

While fixing #19232, it became increasingly clear that the vestigial
hack described in `Note [Optimistic field binder CPR]` is complicated
and causes reboxing. Rather than make the hack worse, this patch
gets rid of it completely in favor of giving deeply unboxed parameters
the Nested CPR property. Example:
```hs
f :: (Int, Int) -> Int
f p = case p of
 (x, y) | x == y    = x
        | otherwise = y
```
Based on `p`'s `idDemandInfo` `1P(1P(L),1P(L))`, we can see that both
fields of `p` will be available unboxed. As a result, we give `p` the
nested CPR property `1(1,1)`. When analysing the `case`, the field
CPRs are transferred to the binders `x` and `y`, respectively, so that
we ultimately give `f` the CPR property.

I took the liberty to do a bit of refactoring:

- I renamed `CprResult` ("Constructed product result result") to plain
  `Cpr`.
- I Introduced `FlatConCpr` in addition to (now nested) `ConCpr` and
  and according pattern synonym that rewrites flat `ConCpr` to
  `FlatConCpr`s, purely for compiler perf reasons.
- Similarly for performance reasons, we now store binders with a
  Top signature in a separate `IntSet`,
  see `Note [Efficient Top sigs in SigEnv]`.
- I moved a bit of stuff around in `GHC.Core.Opt.WorkWrap.Utils` and
  introduced `UnboxingDecision` to replace the `Maybe DataConPatContext`
  type we used to return from `wantToUnbox`.
- Since the `Outputable Cpr` instance changed anyway, I removed the
  leading `m` which we used to emit for `ConCpr`. It's just noise,
  especially now that we may output nested CPRs.

Fixes #19398.

044e5be3

Mar 03, 2021

DmdAnal: Better syntax for demand signatures (#19016) · 3630b9ba

Sebastian Graf authored 4 years ago and

Marge Bot committed 4 years ago

The update of the Outputable instance resulted in a slew of
documentation changes within Notes that used the old syntax.
The most important doc changes are to `Note [Demand notation]`
and the user's guide.

Fixes #19016.

3630b9ba

Dec 14, 2020

Optimise nullary type constructor usage · dad87210

Ben Gamari authored 5 years ago

During the compilation of programs GHC very frequently deals with
the `Type` type, which is a synonym of `TYPE 'LiftedRep`. This patch
teaches GHC to avoid expanding the `Type` synonym (and other nullary
type synonyms) during type comparisons, saving a good amount of work.
This optimisation is described in `Note [Comparing nullary type
synonyms]`.

To maximize the impact of this optimisation, we introduce a few
special-cases to reduce `TYPE 'LiftedRep` to `Type`. See
`Note [Prefer Type over TYPE 'LiftedPtrRep]`.

Closes #17958.

Metric Decrease:
   T18698b
   T1969
   T12227
   T12545
   T12707
   T14683
   T3064
   T5631
   T5642
   T9020
   T9630
   T9872a
   T13035
   haddock.Cabal
   haddock.base

dad87210

Revert "Optimise nullary type constructor usage" · 92377c27
Ben Gamari authored 4 years ago
```
This was inadvertently merged.

This reverts commit 7e9debd4.
```
92377c27

Optimise nullary type constructor usage · 7e9debd4

Ben Gamari authored 5 years ago

During the compilation of programs GHC very frequently deals with
the `Type` type, which is a synonym of `TYPE 'LiftedRep`. This patch
teaches GHC to avoid expanding the `Type` synonym (and other nullary
type synonyms) during type comparisons, saving a good amount of work.
This optimisation is described in `Note [Comparing nullary type
synonyms]`.

To maximize the impact of this optimisation, we introduce a few
special-cases to reduce `TYPE 'LiftedRep` to `Type`. See
`Note [Prefer Type over TYPE 'LiftedPtrRep]`.

Closes #17958.

Metric Decrease:
   T18698b
   T1969
   T12227
   T12545
   T12707
   T14683
   T3064
   T5631
   T5642
   T9020
   T9630
   T9872a
   T13035
   haddock.Cabal
   haddock.base

7e9debd4

Nov 20, 2020

Demand: Interleave usage and strictness demands (#18903) · 0aec78b6

Sebastian Graf authored 4 years ago and

Marge Bot committed 4 years ago

As outlined in #18903, interleaving usage and strictness demands not
only means a more compact demand representation, but also allows us to
express demands that we weren't easily able to express before.

Call demands are *relative* in the sense that a call demand `Cn(cd)`
on `g` says "`g` is called `n` times. *Whenever `g` is called*, the
result is used according to `cd`". Example from #18903:

```hs
h :: Int -> Int
h m =
  let g :: Int -> (Int,Int)
      g 1 = (m, 0)
      g n = (2 * n, 2 `div` n)
      {-# NOINLINE g #-}
  in case m of
    1 -> 0
    2 -> snd (g m)
    _ -> uncurry (+) (g m)
```

Without the interleaved representation, we would just get `L` for the
strictness demand on `g`. Now we are able to express that whenever
`g` is called, its second component is used strictly in denoting `g`
by `1C1(P(1P(U),SP(U)))`. This would allow Nested CPR to unbox the
division, for example.

Fixes #18903.
While fixing regressions, I also discovered and fixed #18957.

Metric Decrease:
    T13253-spj

0aec78b6

Oct 14, 2020

Fix some missed opportunities for preInlineUnconditionally · 15d2340c

Simon Peyton Jones authored 4 years ago and

Marge Bot committed 4 years ago

There are two signficant changes here:

* Ticket #18815 showed that we were missing some opportunities for
  preInlineUnconditionally.  The one-line fix is in the code for
  GHC.Core.Opt.Simplify.Utils.preInlineUnconditionally, which now
  switches off only for INLINE pragmas.  I expanded
  Note [Stable unfoldings and preInlineUnconditionally] to explain.

* When doing this I discovered a way in which preInlineUnconditionally
  was occasionally /too/ eager.  It's all explained in
  Note [Occurrences in stable unfoldings] in GHC.Core.Opt.OccurAnal,
  and the one-line change adding markAllMany to occAnalUnfolding.

I also got confused about what NoUserInline meant, so I've renamed
it to NoUserInlinePrag, and changed its pretty-printing slightly.
That led to soem error messate wibbling, and touches quite a few
files, but there is no change in functionality.

I did a nofib run.  As expected, no significant changes.

        Program           Size    Allocs
----------------------------------------
         sphere          -0.0%     -0.4%
----------------------------------------
            Min          -0.0%     -0.4%
            Max          -0.0%     +0.0%
 Geometric Mean          -0.0%     -0.0%

I'm allowing a max-residency increase for T10370, which seems
very irreproducible. (See comments on !4241.)  There is always
sampling error for max-residency measurements; and in any case
the change shows up on some platforms but not others.

Metric Increase:
    T10370

15d2340c

Jul 28, 2020

This patch addresses the exponential blow-up in the simplifier. · 0bd60059

Simon Peyton Jones authored 4 years ago and

Marge Bot committed 4 years ago

Specifically:
  #13253 exponential inlining
  #10421 ditto
  #18140 strict constructors
  #18282 another nested-function call case

This patch makes one really significant changes: change the way that
mkDupableCont handles StrictArg.  The details are explained in
GHC.Core.Opt.Simplify Note [Duplicating StrictArg].

Specific changes

* In mkDupableCont, when making auxiliary bindings for the other arguments
  of a call, add extra plumbing so that we don't forget the demand on them.
  Otherwise we haev to wait for another round of strictness analysis. But
  actually all the info is to hand.  This change affects:
  - Make the strictness list in ArgInfo be [Demand] instead of [Bool],
    and rename it to ai_dmds.
  - Add as_dmd to ValArg
  - Simplify.makeTrivial takes a Demand
  - mkDupableContWithDmds takes a [Demand]

There are a number of other small changes

1. For Ids that are used at most once in each branch of a case, make
   the occurrence analyser record the total number of syntactic
   occurrences.  Previously we recorded just OneBranch or
   MultipleBranches.

   I thought this was going to be useful, but I ended up barely
   using it; see Note [Note [Suppress exponential blowup] in
   GHC.Core.Opt.Simplify.Utils

   Actual changes:
     * See the occ_n_br field of OneOcc.
     * postInlineUnconditionally

2. I found a small perf buglet in SetLevels; see the new
   function GHC.Core.Opt.SetLevels.hasFreeJoin

3. Remove the sc_cci field of StrictArg.  I found I could get
   its information from the sc_fun field instead.  Less to get
   wrong!

4. In ArgInfo, arrange that ai_dmds and ai_discs have a simpler
   invariant: they line up with the value arguments beyond ai_args
   This allowed a bit of nice refactoring; see isStrictArgInfo,
   lazyArgcontext, strictArgContext

There is virtually no difference in nofib. (The runtime numbers
are bogus -- I tried a few manually.)

        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
            fft          +0.0%     -2.0%    -48.3%    -49.4%      0.0%
     multiplier          +0.0%     -2.2%    -50.3%    -50.9%      0.0%
--------------------------------------------------------------------------------
            Min          -0.4%     -2.2%    -59.2%    -60.4%      0.0%
            Max          +0.0%     +0.1%     +3.3%     +4.9%      0.0%
 Geometric Mean          +0.0%     -0.0%    -33.2%    -34.3%     -0.0%

Test T18282 is an existing example of these deeply-nested strict calls.
We get a big decrease in compile time (-85%) because so much less
inlining takes place.

Metric Decrease:
    T18282

0bd60059

Jul 23, 2020
- Define type Void# = (# #) (#18441) · cfa89149
  Krzysztof Gogolewski authored 4 years ago and Marge Bot committed 4 years ago
  
  There's one backwards compatibility issue: GHC.Prim no longer exports Void#, we now manually re-export it from GHC.Exts.
  cfa89149
Jul 13, 2020

Reduce result discount in conSize · 7ccb760b

Simon Peyton Jones authored 4 years ago

Ticket #18282 showed that the result discount given by conSize
was massively too large.  This patch reduces that discount to
a constant 10, which just balances the cost of the constructor
application itself.

Note [Constructor size and result discount] elaborates, as
does the ticket #18282.

Reducing result discount reduces inlining, which affects perf.  I
found that I could increase the unfoldingUseThrehold from 80 to 90 in
compensation; in combination with the result discount change I get
these overall nofib numbers:

        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
          boyer          -0.2%     +5.4%     -3.2%     -3.4%      0.0%
       cichelli          -0.1%     +5.9%    -11.2%    -11.7%      0.0%
      compress2          -0.2%     +9.6%     -6.0%     -6.8%      0.0%
   cryptarithm2          -0.1%     -3.9%     -6.0%     -5.7%      0.0%
         gamteb          -0.2%     +2.6%    -13.8%    -14.4%      0.0%
         genfft          -0.1%     -1.6%    -29.5%    -29.9%      0.0%
             gg          -0.0%     -2.2%    -17.2%    -17.8%    -20.0%
           life          -0.1%     -2.2%    -62.3%    -63.4%      0.0%
           mate          +0.0%     +1.4%     -5.1%     -5.1%    -14.3%
         parser          -0.2%     -2.1%     +7.4%     +6.7%      0.0%
      primetest          -0.2%    -12.8%    -14.3%    -14.2%      0.0%
         puzzle          -0.2%     +2.1%    -10.0%    -10.4%      0.0%
            rsa          -0.2%    -11.7%     -3.7%     -3.8%      0.0%
         simple          -0.2%     +2.8%    -36.7%    -38.3%     -2.2%
   wheel-sieve2          -0.1%    -19.2%    -48.8%    -49.2%    -42.9%
--------------------------------------------------------------------------------
            Min          -0.4%    -19.2%    -62.3%    -63.4%    -42.9%
            Max          +0.3%     +9.6%     +7.4%    +11.0%    +16.7%
 Geometric Mean          -0.1%     -0.3%    -17.6%    -18.0%     -0.7%

I'm ok with these numbers, remembering that this change removes
an *exponential* increase in code size in some in-the-wild cases.

I investigated compress2.  The difference is entirely caused by this
function no longer inlining

WriteRoutines.$woutputCodes
  = \ (w :: [CodeEvent]) ->
      let result_s1Sr
            = case WriteRoutines.outputCodes_$s$woutput w 0# 0# 8# 9# of
                (# ww1, ww2 #) -> (ww1, ww2)
      in (# case result_s1Sr of (x, _) ->
              map @Int @Char WriteRoutines.outputCodes1 x
         , case result_s1Sr of { (_, y) -> y } #)

It was right on the cusp before, driven by the excessive result
discount.  Too bad!

Happily, the compiler/perf tests show a number of improvements:
    T12227     compiler bytes-alloc  -6.6%
    T12545     compiler bytes-alloc  -4.7%
    T13056     compiler bytes-alloc  -3.3%
    T15263     runtime  bytes-alloc -13.1%
    T17499     runtime  bytes-alloc -14.3%
    T3294      compiler bytes-alloc  -1.1%
    T5030      compiler bytes-alloc -11.7%
    T9872a     compiler bytes-alloc  -2.0%
    T9872b     compiler bytes-alloc  -1.2%
    T9872c     compiler bytes-alloc  -1.5%

Metric Decrease:
    T12227
    T12545
    T13056
    T15263
    T17499
    T3294
    T5030
    T9872a
    T9872b
    T9872c

7ccb760b

Jun 10, 2020

Implement cast worker/wrapper properly · 6d49d5be

Simon Peyton Jones authored 4 years ago and

Marge Bot committed 4 years ago

The cast worker/wrapper transformation transforms
   x = e |> co
into
   y = e
   x = y |> co

This is done by the simplifier, but we were being
careless about transferring IdInfo from x to y,
and about what to do if x is a NOINLNE function.
This resulted in a series of bugs:
     #17673, #18093, #18078.

This patch fixes all that:

* Main change is in GHC.Core.Opt.Simplify, and
  the new prepareBinding function, which does this
  cast worker/wrapper transform.
  See Note [Cast worker/wrappers].

* There is quite a bit of refactoring around
  prepareRhs, makeTrivial etc.  It's nicer now.

* Some wrappers from strictness and cast w/w, notably those for
  a function with a NOINLINE, should inline very late. There
  wasn't really a mechanism for that, which was an existing bug
  really; so I invented a new finalPhase = Phase (-1).  It's used
  for all simplifier runs after the user-visible phase 2,1,0 have
  run.  (No new runs of the simplifier are introduced thereby.)

  See new Note [Compiler phases] in GHC.Types.Basic;
  the main changes are in GHC.Core.Opt.Driver

* Doing this made me trip over two places where the AnonArgFlag on a
  FunTy was being lost so we could end up with (Num a -> ty)
  rather than (Num a => ty)
    - In coercionLKind/coercionRKind
    - In contHoleType in the Simplifier

  I fixed the former by defining mkFunctionType and using it in
  coercionLKind/RKind.

  I could have done the same for the latter, but the information
  is almost to hand.  So I fixed the latter by
    - adding sc_hole_ty to ApplyToVal (like ApplyToTy),
    - adding as_hole_ty to ValArg (like TyArg)
    - adding sc_fun_ty to StrictArg
  Turned out I could then remove ai_type from ArgInfo.  This is
  just moving the deck chairs around, but it worked out nicely.

  See the new Note [AnonArgFlag] in GHC.Types.Var

* When looking at the 'arity decrease' thing (#18093) I discovered
  that stable unfoldings had a much lower arity than the actual
  optimised function.  That's what led to the arity-decrease
  message.  Simple solution: eta-expand.

  It's described in Note [Eta-expand stable unfoldings]
  in GHC.Core.Opt.Simplify

* I also discovered that unsafeCoerce wasn't being inlined if
  the context was boring.  So (\x. f (unsafeCoerce x)) would
  create a thunk -- yikes!  I fixed that by making inlineBoringOK
  a bit cleverer: see Note [Inline unsafeCoerce] in GHC.Core.Unfold.

  I also found that unsafeCoerceName was unused, so I removed it.

I made a test case for #18078, and a very similar one for #17673.

The net effect of all this on nofib is very modest, but positive:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
           anna          -0.4%     -0.1%     -3.1%     -3.1%      0.0%
 fannkuch-redux          -0.4%     -0.3%     -0.1%     -0.1%      0.0%
       maillist          -0.4%     -0.1%     -7.8%     -1.0%    -14.3%
      primetest          -0.4%    -15.6%     -7.1%     -6.6%      0.0%
--------------------------------------------------------------------------------
            Min          -0.9%    -15.6%    -13.3%    -14.2%    -14.3%
            Max          -0.3%      0.0%    +12.1%    +12.4%      0.0%
 Geometric Mean          -0.4%     -0.2%     -2.3%     -2.2%     -0.1%

All following metric decreases are compile-time allocation decreases
between -1% and -3%:

Metric Decrease:
  T5631
  T13701
  T14697
  T15164

6d49d5be

May 13, 2020

CprAnal: Don't attach CPR sigs to expandable bindings (#18154) · 86d8ac22

Sebastian Graf authored 4 years ago and

Marge Bot committed 4 years ago

Instead, look through expandable unfoldings in `cprTransform`.
See the new Note [CPR for expandable unfoldings]:

```
Long static data structures (whether top-level or not) like

  xs = x1 : xs1
  xs1 = x2 : xs2
  xs2 = x3 : xs3

should not get CPR signatures, because they

  * Never get WW'd, so their CPR signature should be irrelevant after analysis
    (in fact the signature might even be harmful for that reason)
  * Would need to be inlined/expanded to see their constructed product
  * Recording CPR on them blows up interface file sizes and is redundant with
    their unfolding. In case of Nested CPR, this blow-up can be quadratic!

But we can't just stop giving DataCon application bindings the CPR property,
for example

  fac 0 = 1
  fac n = n * fac (n-1)

fac certainly has the CPR property and should be WW'd! But FloatOut will
transform the first clause to

  lvl = 1
  fac 0 = lvl

If lvl doesn't have the CPR property, fac won't either. But lvl doesn't have a
CPR signature to extrapolate into a CPR transformer ('cprTransform'). So
instead we keep on cprAnal'ing through *expandable* unfoldings for these arity
0 bindings via 'cprExpandUnfolding_maybe'.

In practice, GHC generates a lot of (nested) TyCon and KindRep bindings, one
for each data declaration. It's wasteful to attach CPR signatures to each of
them (and intractable in case of Nested CPR).
```

Fixes #18154.

86d8ac22

Feb 12, 2020

Always display inferred variables using braces · df084681

Krzysztof Gogolewski authored 5 years ago

We now always show "forall {a}. T" for inferred variables,
previously this was controlled by -fprint-explicit-foralls.

This implements part 1 of https://github.com/ghc-proposals/ghc-proposals/pull/179.

Part of GHC ticket #16320.

Furthermore, when printing a levity restriction error, we now display
the HsWrap of the expression. This lets users see the full elaboration with
-fprint-typechecker-elaboration (see also #17670)

df084681

Separate CPR analysis from the Demand analyser · 059c3c9d

Sebastian Graf authored 6 years ago

The reasons for that can be found in the wiki:
https://gitlab.haskell.org/ghc/ghc/wikis/nested-cpr/split-off-cpr

We now run CPR after demand analysis (except for after the final demand
analysis run just before code gen). CPR got its own dump flags
(`-ddump-cpr-anal`, `-ddump-cpr-signatures`), but not its own flag to
activate/deactivate. It will run with `-fstrictness`/`-fworker-wrapper`.

As explained on the wiki page, this step is necessary for a sane Nested
CPR analysis. And it has quite positive impact on compiler performance:

Metric Decrease:
    T9233
    T9675
    T9961
    T15263

059c3c9d

Jan 31, 2020

Do CafInfo/SRT analysis in Cmm · c846618a

Ömer Sinan Ağacan authored 5 years ago

This patch removes all CafInfo predictions and various hacks to preserve
predicted CafInfos from the compiler and assigns final CafInfos to
interface Ids after code generation. SRT analysis is extended to support
static data, and Cmm generator is modified to allow generating
static_link fields after SRT analysis.

This also fixes `-fcatch-bottoms`, which introduces error calls in case
expressions in CorePrep, which runs *after* CoreTidy (which is where we
decide on CafInfos) and turns previously non-CAFFY things into CAFFY.

Fixes #17648
Fixes #9718

Evaluation
==========

NoFib
-----

Boot with: `make boot mode=fast`
Run: `make mode=fast EXTRA_RUNTEST_OPTS="-cachegrind" NoFibRuns=1`

--------------------------------------------------------------------------------
        Program           Size    Allocs    Instrs     Reads    Writes
--------------------------------------------------------------------------------
             CS          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            CSD          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
             FS          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
              S          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
             VS          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            VSD          -0.0%      0.0%     -0.0%     -0.0%     -0.5%
            VSM          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           anna          -0.1%      0.0%     -0.0%     -0.0%     -0.0%
           ansi          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           atom          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         awards          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         banner          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
     bernouilli          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
   binary-trees          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
          boyer          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         boyer2          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           bspt          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
      cacheprof          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
       calendar          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
       cichelli          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
        circsim          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
       clausify          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
  comp_lab_zift          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
       compress          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
      compress2          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
    constraints          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
   cryptarithm1          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
   cryptarithm2          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            cse          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
   digits-of-e1          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
   digits-of-e2          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         dom-lt          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
          eliza          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
          event          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
    exact-reals          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         exp3_8          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         expert          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
 fannkuch-redux          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
          fasta          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            fem          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            fft          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           fft2          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
       fibheaps          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           fish          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
          fluid          -0.1%      0.0%     -0.0%     -0.0%     -0.0%
         fulsom          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         gamteb          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            gcd          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
    gen_regexps          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         genfft          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
             gg          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           grep          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         hidden          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            hpg          -0.1%      0.0%     -0.0%     -0.0%     -0.0%
            ida          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
          infer          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
        integer          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
      integrate          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
   k-nucleotide          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
          kahan          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
        knights          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         lambda          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
     last-piece          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           lcss          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           life          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           lift          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         linear          -0.1%      0.0%     -0.0%     -0.0%     -0.0%
      listcompr          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
       listcopy          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
       maillist          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         mandel          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
        mandel2          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           mate          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
        minimax          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
        mkhprog          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
     multiplier          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         n-body          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
       nucleic2          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           para          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
      paraffins          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         parser          -0.1%      0.0%     -0.0%     -0.0%     -0.0%
        parstof          -0.1%      0.0%     -0.0%     -0.0%     -0.0%
            pic          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
       pidigits          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
          power          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         pretty          -0.0%      0.0%     -0.3%     -0.4%     -0.4%
         primes          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
      primetest          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         prolog          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         puzzle          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         queens          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
        reptile          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
reverse-complem          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
        rewrite          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           rfib          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            rsa          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            scc          -0.0%      0.0%     -0.3%     -0.5%     -0.4%
          sched          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            scs          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         simple          -0.1%      0.0%     -0.0%     -0.0%     -0.0%
          solid          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
        sorting          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
  spectral-norm          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         sphere          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
         symalg          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
            tak          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
      transform          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
       treejoin          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
      typecheck          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
        veritas          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           wang          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
      wave4main          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
   wheel-sieve1          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
   wheel-sieve2          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           x2n1          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
--------------------------------------------------------------------------------
            Min          -0.1%      0.0%     -0.3%     -0.5%     -0.5%
            Max          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
 Geometric Mean          -0.0%     -0.0%     -0.0%     -0.0%     -0.0%

--------------------------------------------------------------------------------
        Program           Size    Allocs    Instrs     Reads    Writes
--------------------------------------------------------------------------------
        circsim          -0.1%      0.0%     -0.0%     -0.0%     -0.0%
    constraints          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
       fibheaps          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
       gc_bench          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           hash          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
           lcss          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
          power          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
     spellcheck          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
--------------------------------------------------------------------------------
            Min          -0.1%      0.0%     -0.0%     -0.0%     -0.0%
            Max          -0.0%      0.0%     -0.0%     -0.0%     -0.0%
 Geometric Mean          -0.0%     +0.0%     -0.0%     -0.0%     -0.0%

Manual inspection of programs in testsuite/tests/programs
---------------------------------------------------------

I built these programs with a bunch of dump flags and `-O` and compared
STG, Cmm, and Asm dumps and file sizes.

(Below the numbers in parenthesis show number of modules in the program)

These programs have identical compiler (same .hi and .o sizes, STG, and
Cmm and Asm dumps):

- Queens (1), andre_monad (1), cholewo-eval (2), cvh_unboxing (3),
  andy_cherry (7), fun_insts (1), hs-boot (4), fast2haskell (2),
  jl_defaults (1), jq_readsPrec (1), jules_xref (1), jtod_circint (4),
  jules_xref2 (1), lennart_range (1), lex (1), life_space_leak (1),
  bargon-mangler-bug (7), record_upd (1), rittri (1), sanders_array (1),
  strict_anns (1), thurston-module-arith (2), okeefe_neural (1),
  joao-circular (6), 10queens (1)

Programs with different compiler outputs:

- jl_defaults (1): For some reason GHC HEAD marks a lot of top-level
  `[Int]` closures as CAFFY for no reason. With this patch we no longer
  make them CAFFY and generate less SRT entries. For some reason Main.o
  is slightly larger with this patch (1.3%) and the executable sizes are
  the same. (I'd expect both to be smaller)

- launchbury (1): Same as jl_defaults: top-level `[Int]` closures marked
  as CAFFY for no reason. Similarly `Main.o` is 1.4% larger but the
  executable sizes are the same.

- galois_raytrace (13): Differences are in the Parse module. There are a
  lot, but some of the changes are caused by the fact that for some
  reason (I think a bug) GHC HEAD marks the dictionary for `Functor
  Identity` as CAFFY. Parse.o is 0.4% larger, the executable size is the
  same.

- north_array: We now generate less SRT entries because some of array
  primops used in this program like `NewArrayOp` get eliminated during
  Stg-to-Cmm and turn some CAFFY things into non-CAFFY. Main.o gets 24%
  larger (9224 bytes from 9000 bytes), executable sizes are the same.

- seward-space-leak: Difference in this program is better shown by this
  smaller example:

      module Lib where

      data CDS
        = Case [CDS] [(Int, CDS)]
        | Call CDS CDS

      instance Eq CDS where
        Case sels1 rets1 == Case sels2 rets2 =
            sels1 == sels2 && rets1 == rets2
        Call a1 b1 == Call a2 b2 =
            a1 == a2 && b1 == b2
        _ == _ =
            False

   In this program GHC HEAD builds a new SRT for the recursive group of
   `(==)`, `(/=)` and the dictionary closure. Then `/=` points to `==`
   in its SRT field, and `==` uses the SRT object as its SRT. With this
   patch we use the closure for `/=` as the SRT and add `==` there. Then
   `/=` gets an empty SRT field and `==` points to `/=` in its SRT
   field.

   This change looks fine to me.

   Main.o gets 0.07% larger, executable sizes are identical.

head.hackage
------------

head.hackage's CI script builds 428 packages from Hackage using this
patch with no failures.

Compiler performance
--------------------

The compiler perf tests report that the compiler allocates slightly more
(worst case observed so far is 4%). However most programs in the test
suite are small, single file programs. To benchmark compiler performance
on something more realistic I build Cabal (the library, 236 modules)
with different optimisation levels. For the "max residency" row I run
GHC with `+RTS -s -A100k -i0 -h` for more accurate numbers. Other rows
are generated with just `-s`. (This is because `-i0` causes running GC
much more frequently and as a result "bytes copied" gets inflated by
more than 25x in some cases)

* -O0

|                 | GHC HEAD       | This MR        | Diff   |
| --------------- | -------------- | -------------- | ------ |
| Bytes allocated | 54,413,350,872 | 54,701,099,464 | +0.52% |
| Bytes copied    |  4,926,037,184 |  4,990,638,760 | +1.31% |
| Max residency   |    421,225,624 |    424,324,264 | +0.73% |

* -O1

|                 | GHC HEAD        | This MR         | Diff   |
| --------------- | --------------- | --------------- | ------ |
| Bytes allocated | 245,849,209,992 | 246,562,088,672 | +0.28% |
| Bytes copied    |  26,943,452,560 |  27,089,972,296 | +0.54% |
| Max residency   |     982,643,440 |     991,663,432 | +0.91% |

* -O2

|                 | GHC HEAD        | This MR         | Diff   |
| --------------- | --------------- | --------------- | ------ |
| Bytes allocated | 291,044,511,408 | 291,863,910,912 | +0.28% |
| Bytes copied    |  37,044,237,616 |  36,121,690,472 | -2.49% |
| Max residency   |   1,071,600,328 |   1,086,396,256 | +1.38% |

Extra compiler allocations
--------------------------

Runtime allocations of programs are as reported above (NoFib section).

The compiler now allocates more than before. Main source of allocation
in this patch compared to base commit is the new SRT algorithm
(GHC.Cmm.Info.Build). Below is some of the extra work we do with this
patch, numbers generated by profiled stage 2 compiler when building a
pathological case (the test 'ManyConstructors') with '-O2':

- We now sort the final STG for a module, which means traversing the
  entire program, generating free variable set for each top-level
  binding, doing SCC analysis, and re-ordering the program. In
  ManyConstructors this step allocates 97,889,952 bytes.

- We now do SRT analysis on static data, which in a program like
  ManyConstructors causes analysing 10,000 bindings that we would
  previously just skip. This step allocates 70,898,352 bytes.

- We now maintain an SRT map for the entire module as we compile Cmm
  groups:

      data ModuleSRTInfo = ModuleSRTInfo
        { ...
        , moduleSRTMap :: SRTMap
        }

   (SRTMap is just a strict Map from the 'containers' library)

   This map gets an entry for most bindings in a module (exceptions are
   THUNKs and CAFFY static functions). For ManyConstructors this map
   gets 50015 entries.

- Once we're done with code generation we generate a NameSet from SRTMap
  for the non-CAFFY names in the current module. This set gets the same
  number of entries as the SRTMap.

- Finally we update CafInfos in ModDetails for the non-CAFFY Ids, using
  the NameSet generated in the previous step. This usually does the
  least amount of allocation among the work listed here.

Only place with this patch where we do less work in the CAF analysis in
the tidying pass (CoreTidy). However that doesn't save us much, as the
pass still needs to traverse the whole program and update IdInfos for
other reasons. Only thing we don't here do is the `hasCafRefs` pass over
the RHS of bindings, which is a stateless pass that returns a boolean
value, so it doesn't allocate much.

(Metric changes blow are all increased allocations)

Metric changes
--------------

Metric Increase:
    ManyAlternatives
    ManyConstructors
    T13035
    T14683
    T1969
    T9961

c846618a

Jan 08, 2020

Print Core type applications with no whitespace after @ (#17643 ) · 923a1272

Ryan Scott authored 5 years ago and

Marge Bot committed 5 years ago

This brings the pretty-printer for Core in line with how visible
type applications are normally printed: namely, with no whitespace
after the `@` character (i.e., `f @a` instead of `f @ a`). While I'm
in town, I also give the same treatment to type abstractions (i.e.,
`\(@a)` instead of `\(@ a)`) and coercion applications (i.e.,
`f @~x` instead of `f @~ x`).

Fixes #17643.

923a1272

Dec 12, 2018

Improvements to demand analysis · d77501cd

Simon Peyton Jones authored 6 years ago

This patch collects a few improvements triggered by Trac #15696,
and fixing Trac #16029

* Stop making toCleanDmd behave specially for unlifted types.
  This special case was the cause of stupid behaviour in Trac
  #16029.  And to my joy I discovered the let/app invariant
  rendered it unnecessary.  (Maybe the special case pre-dated
  the let/app invariant.)

  Result: less special-case handling in the compiler, and
  better perf for the compiled code.

* In WwLib.mkWWstr_one, treat seqDmd like U(AAA).  It was not
  being so treated before, which again led to stupid code.

* Update and improve Notes

There are .stderr test wibbles because we get slightly different
strictness signatures for an argumment of unlifted type:
    <L,U> rather than <S,U>        for Int#
    <S,U> rather than <S(S),U(U)>  for Int

d77501cd

Jun 07, 2018

Remove ad-hoc special case in occAnal · c16382d5

Simon Peyton Jones authored 6 years ago

Back in 1999 I put this ad-hoc code in the Case-handling
code for occAnal:

  occAnal env (Case scrut bndr ty alts)
   = ...
        -- Note [Case binder usage]
        -- ~~~~~~~~~~~~~~~~~~~~~~~~
        -- The case binder gets a usage of either "many" or "dead", never "one".
        -- Reason: we like to inline single occurrences, to eliminate a binding,
        -- but inlining a case binder *doesn't* eliminate a binding.
        -- We *don't* want to transform
        --      case x of w { (p,q) -> f w }
        -- into
        --      case x of w { (p,q) -> f (p,q) }
    tag_case_bndr usage bndr
      = (usage', setIdOccInfo bndr final_occ_info)
      where
        occ_info       = lookupDetails usage bndr
        usage'         = usage `delDetails` bndr
        final_occ_info = case occ_info of IAmDead -> IAmDead
                                          _       -> noOccInfo

But the comment looks wrong -- the bad inlining will not happen -- and
I think it relates to some long-ago version of the simplifier.

So I simply removed the special case, which gives more accurate
occurrence-info to the case binder.  Interestingly I got a slight
improvement in nofib binary sizes.

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
      cacheprof          -0.1%     +0.2%     -0.7%     -1.2%     +8.6%
--------------------------------------------------------------------------------
            Min          -0.2%      0.0%    -14.5%    -30.5%      0.0%
            Max          -0.1%     +0.2%    +10.0%    +10.0%    +25.0%
 Geometric Mean          -0.2%     +0.0%     -1.9%     -5.4%     +0.3%

I have no idea if the improvement in runtime is real.  I did look at the
tiny increase in allocation for cacheprof and concluded that it was
unimportant (I forget the details).

Also the more accurate occ-info for the case binder meant that some
inlining happens in one pass that previously took successive passes
for the test dependent/should_compile/dynamic-paper (which has a
known Russel-paradox infinite loop in the simplifier).

In short, a small win: less ad-hoc complexity and slightly smaller
binaries.

c16382d5

Apr 20, 2018

Inline wrappers earlier · 8b10b896

Simon Peyton Jones authored 6 years ago

This patch has a single significant change:

  strictness wrapper functions are inlined earlier,
  in phase 2 rather than phase 0.

As shown by Trac #15056, this gives a better chance for RULEs to fire.
Before this change, a function that would have inlined early without
strictness analyss was instead inlining late. Result: applying
"optimisation" made the program worse.

This does not make too much difference in nofib, but I've stumbled
over the problem more than once, so even a "no-change" result would be
quite acceptable.  Here are the headlines:

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
      cacheprof          -0.5%     -0.5%     +2.5%     +2.5%      0.0%
         fulsom          -1.0%     +2.6%     -0.1%     -0.1%      0.0%
           mate          -0.6%     +2.4%     -0.9%     -0.9%      0.0%
        veritas          -0.7%    -23.2%     0.002     0.002      0.0%
--------------------------------------------------------------------------------
            Min          -1.4%    -23.2%    -12.5%    -15.3%      0.0%
            Max          +0.6%     +2.6%     +4.4%     +4.3%    +19.0%
 Geometric Mean          -0.7%     -0.2%     -1.4%     -1.7%     +0.2%

* A worthwhile reduction in binary size.

* Runtimes are not to be trusted much but look as if they
  are moving the right way.

* A really big win in veritas, described in comment:1 of
  Trac #15056; more fusion rules fired.

* I investigated the losses in 'mate' and 'fulsom'; see #15056.

8b10b896

Jan 03, 2018

Get evaluated-ness right in the back end · bd438b2d

Simon Peyton Jones authored 7 years ago

See Trac #14626, comment:4.  We want to maintain evaluted-ness
info on Ids into the code generateor for two reasons
(see Note [Preserve evaluated-ness in CorePrep] in CorePrep)

- DataToTag magic
- Potentially using it in the codegen (this is Gabor's
  current work)

But it was all being done very inconsistently, and actually
outright wrong -- the DataToTag magic hasn't been working for
years.

This patch tidies it all up, with Notes to match.

bd438b2d

Sep 12, 2017

Allow CSE'ing of work-wrapped bindings (#14186 ) · fe04f378

Joachim Breitner authored 7 years ago

the worker/wrapper creates an artificial INLINE pragma, which caused CSE
to not do its work. We now recognize such artificial pragmas by using
`NoUserInline` instead of `Inline` as the `InlineSpec`.

Differential Revision: https://phabricator.haskell.org/D3939

fe04f378

Mar 06, 2017

Make FloatOut/SetLevels idemoptent on bottoming functions · fb9ae288

Simon Peyton Jones authored 8 years ago

This fixes Trac #13369.  It turned out that I really had got the
bottoming-float code wrong, again.  The new story is explained in
Note [Bottoming floats], esp item (3), and Note [Floating from a RHS].

I didn't make a regression test; it's hard to to so.

Nofib result are good

--------------------------------------------------------------------------------
        Program           Size    Allocs   Runtime   Elapsed  TotalMem
--------------------------------------------------------------------------------
         banner          -2.2%     -4.6%      0.00      0.00     +0.0%
           bspt          -1.3%     -1.6%      0.01      0.01     +0.0%
      cacheprof          -1.8%     -0.3%     +3.7%     +3.7%     -0.9%
   digits-of-e2          -1.0%     -1.5%     -0.5%     -0.5%     +0.0%
         expert          -1.3%     -0.2%      0.00      0.00     +0.0%
         n-body          -1.1%     -0.2%     +0.1%     +0.1%     +0.0%
        veritas          -2.9%     -0.1%      0.00      0.00     +0.0%
--------------------------------------------------------------------------------
            Min          -2.9%     -4.6%     -7.4%     -7.4%    -19.8%
            Max          -1.0%     +0.0%     +5.2%     +5.1%    +10.0%
 Geometric Mean          -1.2%     -0.1%     +0.5%     +0.5%     -0.1%

I /think/ all this is due to this error-floating change; but it's possible
that some was due to commit "Fix CSE (again) on literal strings" a couple
of commits earlier.

fb9ae288

Mar 02, 2017

Fix expected result from T13143 · d4a6a7fe

David Feuer authored 8 years ago

Reviewers: austin, bgamari

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D3260

d4a6a7fe

Feb 06, 2017

Do Worker/Wrapper for NOINLINE things · b572aadb

Eric Seidel authored 8 years ago and

Ben Gamari committed 8 years ago

Disabling worker/wrapper for NOINLINE things can cause unnecessary
reboxing of values. Consider

    {-# NOINLINE f #-}
    f :: Int -> a
    f x = error (show x)

    g :: Bool -> Bool -> Int -> Int
    g True  True  p = f p
    g False True  p = p + 1
    g b     False p = g b True p

the strictness analysis will discover f and g are strict, but because f
has no wrapper, the worker for g will rebox p. So we get

    $wg x y p# =
      let p = I# p# in  -- Yikes! Reboxing!
      case x of
        False ->
          case y of
            False -> $wg False True p#
            True -> +# p# 1#
        True ->
          case y of
            False -> $wg True True p#
            True -> case f p of { }

    g x y p = case p of (I# p#) -> $wg x y p#

Now, in this case the reboxing will float into the True branch, an so
the allocation will only happen on the error path. But it won't float
inwards if there are multiple branches that call (f p), so the reboxing
will happen on every call of g. Disaster.

Solution: do worker/wrapper even on NOINLINE things; but move the
NOINLINE pragma to the worker.

Test Plan: make test TEST="13143"

Reviewers: simonpj, bgamari, dfeuer, austin

Reviewed By: simonpj, bgamari

Subscribers: dfeuer, thomie

Differential Revision: https://phabricator.haskell.org/D3046

b572aadb