Line that should allocate nothing adds ~30% to allocations
In optimizing the rewriter, I came across an interesting scenario.
We have this code:
rewrite_args_tc tc = rewrite_args all_bndrs any_named_bndrs inner_ki emptyVarSet
-- NB: TyCon kinds are always closed
where
(bndrs, named)
= ty_con_binders_ty_binders' (tyConBinders tc)
-- !_named = any isNamedTyConBinder (tyConBinders tc)
-- it's possible that the result kind has arrows (for, e.g., a type family)
-- so we must split it
(inner_bndrs, inner_ki, inner_named) = split_pi_tys' (tyConResKind tc)
!all_bndrs = bndrs `chkAppend` inner_bndrs
!any_named_bndrs = named || inner_named
-- NB: Those bangs there drop allocations in T9872{a,c,d} by 8%.
Note the commented-out line in the middle. It is an alternate way of computing named
, and I was wondering whether the alternate way would be faster. On test case T9872a, without the commented-out line, GHC allocates 2,186,222,936 bytes. If I allow the line in, GHC allocates 2,879,994,624 bytes, an impressive 31.7% increase. From that one line. That's really it. Note that the one line is a call to any
, which should, by no means, do any allocation. tyConBinders
is a field accessor for a list, so this should compile into a tight, non-allocating loop. isNamedTyConBinder
is similarly innocent-looking:
isNamedTyConBinder :: TyConBinder -> Bool
isNamedTyConBinder (Bndr _ (NamedTCB {})) = True
isNamedTyConBinder _ = False
Furthermore, the existing ty_con_binders_ty_binders
forces tyConBinders
, so the allocations are somehow coming from the call to any
.
I've compiled with -ddump-simpl
, attached. Note go9_rqa6
, go10_rqa7
, and a few more friends. These functions, all identical, are the any
loop. They don't appear to allocate. They are invoked from lvl29_shqZ
, which is indeed forced. But I still fail to see who is allocating! Unless it's maybe the lvl29_shqZ
which is getting let
-allocated, and that's taking up all the space?
This problem is not holding me up. But it may reveal something naughty GHC is doing that is causing trouble elsewhere. I'm labeling this as "runtime perf" because the problem is that GHC is producing bad code -- the fact that it is producing GHC's code is essentially irrelevant.