last uses too much space with optimizations disabled

changed weight to 5

added Tbug Trac import core libraries labels

changed the description

What is the "new definition"? I'm pretty hazy about what exactly you are proposing here. Could you write it out explicitly and then the Core Libraries Committee can decide.

assigned to @nomeata and unassigned @trac-ekmett

When foldl started to be a good consumer, we implemented last using foldl, which is cool, but not helpful without optimization.

I’ll take care of this.

Put this up at D844 and will push this once it validated.

Trac metadata

Trac field	Value
Differential revisions	→ D844

@rwbarton, just to make sure I don’t get you wrong: Are you talking about compiling a "normal" program using last, using a normally build compiler, or are you talking about compiling the base libraries without optimization, e.g. during GHC development?

I had a closer look, and the problem is the following:

When we added the implementation

last = foldl (\_ x -> x) (error "..."))

to GHC.List in #9339 (closed), we assumed (and I thought we checked, but maybe that not true) that GHC would optimize the core will look something like this:

last :: forall a_ask. [a_ask] -> a_ask
[GblId,
 Arity=1,
 Str=DmdType,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 30 60}
         Tmpl=\ (@ a_aTE) -> foldl @ a_aTE @ a_aTE (GHC.List.last2 @ a_aTE) (GHC.List.last1 @ a_aTE)]
last = .. efficient implementation derived from foldl ... ..

This way, when using last in code compiled without -O, the efficient variant would be called, while with -O the unfolding would be used and optimized on the spot, possibly fusing, and producing good code if Call Arity kicks in.

Unfortunately, this is what we see:

last :: forall a_ask. [a_ask] -> a_ask
[GblId,
 Arity=1,
 Str=DmdType,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [] 30 60}]
last =
  \ (@ a_aTE) ->
    foldl
      @ a_aTE @ a_aTE (GHC.List.last2 @ a_aTE) (GHC.List.last1 @ a_aTE)

Now the question is: Why is the code not optimized in GHC.List? My guess is that we should have written

last xs = foldl (\_ x -> x) (error "...")) xs

And indeed, this way, we get:

Rec {
GHC.List.last2 [Occ=LoopBreaker]
  :: forall a_aTF. [a_aTF] -> a_aTF -> a_aTF
[GblId, Arity=2, Caf=NoCafRefs, Str=DmdType <S,1*U><L,1*U>]
GHC.List.last2 =
  \ (@ a_aTF) (ds_a1Es :: [a_aTF]) (eta_B1 :: a_aTF) ->
    case ds_a1Es of _ [Occ=Dead] {
      [] -> eta_B1;
      : y_a1Ex ys_a1Ey -> GHC.List.last2 @ a_aTF ys_a1Ey y_a1Ex
    }
end Rec }

last :: forall a_ask. [a_ask] -> a_ask
[GblId,
 Arity=1,
 Str=DmdType <S,1*U>,
 Unf=Unf{Src=<vanilla>, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [0] 30 0}]
last =
  \ (@ a_aTF) (xs_ast :: [a_aTF]) ->
    GHC.List.last2 @ a_aTF xs_ast (GHC.List.last1 @ a_aTF)

which his nice, but now we lost the original definition.

Next try, also adding {-# INLINEABLE last #-}, which yields.

Rec {
poly_go_r2F3 :: forall a_aTF. [a_aTF] -> a_aTF -> a_aTF
[GblId, Arity=2, Caf=NoCafRefs, Str=DmdType <S,1*U><L,1*U>]
poly_go_r2F3 =
  \ (@ a_aTF) (ds_a1Es :: [a_aTF]) (eta_B1 :: a_aTF) ->
    case ds_a1Es of _ [Occ=Dead] {
      [] -> eta_B1;
      : y_a1Ex ys_a1Ey -> poly_go_r2F3 @ a_aTF ys_a1Ey y_a1Ex
    }
end Rec }

last [InlPrag=INLINABLE[ALWAYS]] :: forall a_ask. [a_ask] -> a_ask
[GblId,
 Arity=1,
 Str=DmdType <S,1*U>,
 Unf=Unf{Src=InlineStable, TopLvl=True, Value=True, ConLike=True,
         WorkFree=True, Expandable=True, Guidance=IF_ARGS [0] 160 0
         Tmpl= \ (@ a_aTF) (xs_ast [Occ=Once] :: [a_aTF]) ->
                 foldr
                   @ a_aTF
                   @ (a_aTF -> a_aTF)
                   (\ (ds_d1DA [Occ=Once] :: a_aTF)
                      (ds1_d1DB [Occ=Once!, OS=OneShot] :: a_aTF -> a_aTF)
                      _ [Occ=Dead, OS=OneShot] ->
                      ds1_d1DB ds_d1DA)
                   (id @ a_aTF)
                   xs_ast
                   (errorEmptyList
                      @ a_aTF
                      (build
                         @ Char (\ (@ b_a1Fd) -> unpackFoldrCString# @ b_a1Fd "last"#)))}]
last =
  \ (@ a_aTF) (xs_ast :: [a_aTF]) ->
    poly_go_r2F3 @ a_aTF xs_ast (lvl8_r2F2 @ a_aTF)

That looks better. The unfolding is a bit more complicated, as foldl is being lined, but that is presumably ok.

I’ll create a pull request with this in a small while, for review and discussion.

INLINEABLE does not work; probably because the unfolding looks too large. But INLINE seems to work. So with this patch

-last = foldl (\_ x -> x) (errorEmptyList "last")
+last xs = foldl (\_ x -> x) lastError xs
+{-# INLINE last #-}
+lastError = errorEmptyList "last"

we have

that GHC.List.last is used without -O, with the desired memory behaviour,
with -O, last gets inlined and fusion can happen.

(lastError pulled out to not copy that wherever last is used).

Second try: D846

Trac metadata

Trac field	Value
Differential revisions	D844 → D846

I mean D847

Trac metadata

Trac field	Value
Differential revisions	D846 → D847

Replying to [ticket:10260#comment:98682 nomeata]:

@rwbarton, just to make sure I don’t get you wrong: Are you talking about compiling a "normal" program using last, using a normally build compiler, or are you talking about compiling the base libraries without optimization, e.g. during GHC development?

I mean using a compiler built with "perf", e.g. the 7.10.1 release, and compiling the user program without optimization.

Re ticket:10260#comment:98687 I too was surprised that last was not eta-expanded. But see this note in SimplUtils:

Note [Do not eta-expand PAPs]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We used to have old_arity = manifestArity rhs, which meant that we
would eta-expand even PAPs.  But this gives no particular advantage,
and can lead to a massive blow-up in code size, exhibited by Trac #9020.
Suppose we have a PAP
    foo :: IO ()
    foo = returnIO ()
Then we can eta-expand do
    foo = (\eta. (returnIO () |> sym g) eta) |> g
where
    g :: IO () ~ State# RealWorld -> (# State# RealWorld, () #)

But there is really no point in doing this, and it generates masses of
coercions and whatnot that eventually disappear again. For T9020, GHC
allocated 6.6G beore, and 0.8G afterwards; and residency dropped from
1.8G to 45M.

But note that this won't eta-expand, say
  f = \g -> map g
Does it matter not eta-expanding such functions?  I'm not sure.  Perhaps
strictness analysis will have less to bite on?

The worse code for the un-eta-expanded code for last is perhaps an example of the last paragraph!

And yet the comment is reasonably convincing. I'm not sure what to do here.

Joachim, one thing that might be worth trying is to eta-expand as far as poss without bumping into a coercion. That would expand last but not foo. Want to try that? We may fix last but there are sure to be other cases, and the more robust the optimisations the better.

And yet the comment is reasonably convincing. I'm not sure what to do here.

I think we are doing the right thing here. As far as I know, we also take the number of arguments given by the user as an indication when things should inline or not; if we eta-expand PAPs, we take this possibility of control away from the user.

In this particular case: What if I had written last this way intentional, so that the rule for foldl would not fire here, but only after last was inlined at a position where it is passed an argument?

(Or am I talking non-sense if we are talking about PAPs, in contrast to functions that might be eta-expanded more.)

Well all eta-expansion risks what you say. If the user wants hat kind of control s/he can use INLINE/INLINEABLE. The eta expansion doesn't happen for the template that is inlined; only in the RHS that is compiled. Which is good. So I do think we should eta-expand here. The only reason not to is described in the comment. I think.

Ah, right, the template is stored before this happens; good.

What is our motivation to eta-expand PAPs? In this case, it is to allow for RULES to fire. Is it not likely that in order for RULES to fire, we might have to eta-expand beyond a coercion? So it does not gain a lot in robustness.

I’m inclined to wait for more than just this example to give this a shot, after all, implementing the heuristics (“if it is a PAP, and there is a chance to eta-expand, but only up to the next coercion...”) does not really make the code simpler...

Maybe. But eta expansion is in general a tremendously good thing. I'd love just to try it.

last uses too much space with optimizations disabled

Child items ...

Activity