While trying to optimize the linear branch, I decided to look at a ticky output of compiling a file. (The precise details are not relevant.) I took these results and sorted by the number of entries. The idea was to get a sense of what functions are actually important to pay attention to. The results are shocking. Here is the head:
We spend a lot of energy inserting and looking in maps. Perhaps that's expected. But these are lazy maps. Is that good? Is it possible that (gasp) a mutable hashtable would be better?
We spend a lot of energy dealing with lists. Lists! Lists are generally poor for anything other than iteration, where they should be fused away and never heard from. Yet this list contains entries for map, length(!), ++(!), reverse, and elem(!!!!!!). What on earth are we using elem (linear lookup in a list) for? And can we please find some better idea than (++), which copies one of its inputs.
We spend a lot of calls forcing things: seqList, seqTypes, seqType. Maybe we just need stricter data structures.
We see (==). Any use of == on a concrete type should be blasted away by now. So this must be == on polymorphic types. Yet it persists. Maybe clever uses of SPECIALISE pragmas will kill this.
It is possible that tracking these down will find easy ways to make a big difference to compilation times. Note that there is nothing particular about linear types here (though implementing a mutable hashtable using linear types is tempting).
Although performance tuning Haskell code is hard, this might be a nice ticket for an experienced Haskeller (but not necessarily an experienced GHC hacker) to tackle. Happy to offer advice!
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Child items
0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items
0
Link issues together to show that they're related or that one is blocking others.
Learn more.
It would be interesting to know total allocations for this report to put the numbers in perspective.
Is IntMap the best map for our usecase.
Based on this benchmark we should at least investigate switching to a different map type as an option where maps requiring IO are reasonable.
The downside is of course that it's possible that IO/ST will creep into a lot of code, negating parts of the performance benefit as well as making the code harder to work with.
Pure HashMaps on the other hand don't seem to be worth it.
(==)
There is #17759 which means almost any expression of the sort (xs == ys) where xs has a concrete type [Foo]` still can't be specialized.
I suspect this is the reason why == shows up so often.
Lists
List fusion can be unreliable. It can fail in practice for various reasons. The most recent example being elem: #18034 (closed). GHC also often uses lists explicitly as data structures where the compiler can't eliminate them via list fusion. So "map" showing up is not at all surprising.
What on earth are we using elem (linear lookup in a list) for?
Things like tc `elem` [ eqPrimTyCon, eqReprPrimTyCon, heqTyCon ] are perfectly sane. The register allocator also uses elem and even nub(!) in places where it's guaranteed that the given lists will be very small. Sometimes it's just more overhead to build a set than to just use elem on the list directly.
For (++) usually we should use OrdList to avoid the problem. It's used heavily in the backend exactly to avoid the use of (++) already.
Based on the allocation counts it seems likely there are some bad uses here. But we might just have a lot of sane uses which in their sum amount to a lot of work. That is to say if we replaced all of them we might not see any benefit in terms of runtime.
Using more array based structures to pass around things like bindings would still be good. But it's a lot of work and there are things we can do for lists which are currently not possible for arrays.
I tried going in that direction in the past, but incremental changes made things worse as they introduced a lot of conversions from/to lists. In the end I wasn't willing to spend the time to refactor enough code to say for sure if it would pay off.
Strictness
Parts of the compiler could benefit from using stricter types, including maps, in others we would loop/crash. So this is not easily explored. I tried making some maps and types strict in the past and often it either made no difference or caused correctness issues.
We probably can do a bit of a better job there. But it heavily depends on the part of GHC you are looking at and none of the wins will be all that low hanging I fear.
From #20222 (closed) I narrowed down the loops caused by strictness in FM.hs to just 3 functions. Then I swapped all IntMaps and Maps to be strict in this branch.
The CI showed that allocations increased by up to 10% in some cases. I didn't dig any further with perf though.
It's often the case that when you make some things scrict, the unnecessary long-lived thunks just get pushed elsewhere, but don't disappear. The new thunks and their associated memory leaks, if any, can easily be bigger, because more things are created due to evaluation, but not necessarily anything get released.
What I'm saying the 10% may not come from pessimisation, but from not enough sctrictness still.
I've made a patch to avoid allocations in IntMap.lookup.
Patch
--- containers/src/Data/IntMap/Internal.hs+++ containers/src/Data/IntMap/Internal.hs@@ -584,31 +584,28 @@ notMember :: Key -> IntMap a -> Bool notMember k m = not $ member k m -- | /O(min(n,W))/. Lookup the value at a key in the map. See also 'Data.Map.lookup'.---- See Note: Local 'go' functions and capturing] lookup :: Key -> IntMap a -> Maybe a-lookup !k = go- where- go (Bin p m l r) | nomatch k p m = Nothing- | zero k m = go l- | otherwise = go r- go (Tip kx x) | k == kx = Just x- | otherwise = Nothing- go Nil = Nothing+lookup = go+ where+ go !k (Bin p m l r) | nomatch k p m = Nothing+ | zero k m = go k l+ | otherwise = go k r+ go !k (Tip kx x) | k == kx = Just x+ | otherwise = Nothing+ go _ Nil = Nothing--- See Note: Local 'go' functions and capturing] find :: Key -> IntMap a -> a-find !k = go+find = go where- go (Bin p m l r) | nomatch k p m = not_found- | zero k m = go l- | otherwise = go r- go (Tip kx x) | k == kx = x- | otherwise = not_found- go Nil = not_found-- not_found = error ("IntMap.!: key " ++ show k ++ " is not an element of the map")+ go !k (Bin p m l r) | nomatch k p m = not_found k+ | zero k m = go k l+ | otherwise = go k r+ go !k (Tip kx x) | k == kx = x+ | otherwise = not_found k+ go !k Nil = not_found k++ not_found !k = error ("IntMap.!: key " ++ show k ++ " is not an element of the map")
It doesn't seem to have much impact on containers' intmap-benchmark. Could someone try to reproduce ticky results with GHC?
Things like tc `elem` [ eqPrimTyCon, eqReprPrimTyCon, heqTyCon ] are perfectly sane.
Because it's not obvious I want to point out that this code used to generate a call to elem, but does not any more and that this has been a fairly recent change. As I remember from before @rae did this analysis.
Based on commit acf537f9Make splitAtList strict in its arguments
By invoking:
'E:\ghc_head\_ticky\stage1\bin\ghc.exe' 'nofib/spectral/simple/Main.hs' '-O' '-fforce-recomp' +RTS '-s' '-r'
RTS stats:
$ _ticky/stage1/bin/ghc.exe nofib/spectral/simple/Main.hs -O -fforce-recomp +RTS -s -r[1 of 1] Compiling Main ( nofib\spectral\simple\Main.hs, nofib\spectral\simple\Main.o )Linking nofib\spectral\simple\Main.exe ... 4,384,197,400 bytes allocated in the heap 925,177,096 bytes copied during GC 75,695,032 bytes maximum residency (14 sample(s)) 515,144 bytes maximum slop 205 MiB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 431 colls, 0 par 2.531s 2.584s 0.0060s 0.0565s Gen 1 14 colls, 0 par 1.484s 1.535s 0.1096s 0.2819s TASKS: 5 (1 bound, 4 peak workers (4 total), using -N1) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.000s ( 0.001s elapsed) MUT time 3.203s ( 5.742s elapsed) GC time 4.016s ( 4.119s elapsed) EXIT time 0.000s ( 0.001s elapsed) Total time 7.219s ( 9.862s elapsed) Alloc rate 1,368,725,041 bytes per MUT second Productivity 44.4% of total user, 58.2% of total elapsed
With no other contexts I will just use the number of inserts as baseline to compare the reports.
Thing that stood out to me:
In the original ticket we have 1-2 calls to seqList per insert call. In the ticky profile I created it's one call to seqList per ~ 200 inserts. Other seq calls are also rarer.
There is about half as many calls to elem which might be a result of the improved rules I mentioned. If the remaining calls are warranted or would be better served by map lookups is still unclear however.
A lot of other changes, but none that stand out all that much.
Each application of the mapping function should result in an unknown call. So 7M unknown calls almost seems a bit low with >2M calls to map. But I suppose it's possible.
Maybe it's worthwhile to look into allowing map to inline, so that the mapping function becomes a known call. (E.g. use a local recursive go function)
In the original ticket we have 1-2 calls to seqList per insert call. In the ticky profile I created it's one call to seqList per ~ 200 inserts. Other seq calls are also rarer.
That sounds like an improvement. But I've lost track of what you are comparing. GHC baseline vs (GHC baseline + splitAt patch)? Or what? How are we getting so many fewer calls to seqList?
Before inlining map (and thereby spreading its allocations around) I wonder if we could find out where all those calls to map are coming from, and how long the lists are. But I agree about your point about unknown calls.
That sounds like an improvement. But I've lost track of what you are comparing. GHC baseline vs (GHC baseline + splitAt patch)? Or what? How are we getting so many fewer calls to seqList?
Hard to say! I'm comparing my profile with the one from richard. Using the number of inserts into IntMap as a measure of work done by the compiler.
This is is a pretty bad measure. But without knowing what commit his profile was based on, what file ghc was compiling and which options were used to compile it I feel it's the best I could do.
All it really shows is that in some unknown case we call seqList a lot. In another case (which I found historically to be pretty representative) we don't. So before we spend time to get rid of these, we should first establish that they are a problem in the common case.
Before inlining map (and thereby spreading its allocations around) I wonder if we could find out where all those calls to map are coming from, and how long the lists are. But I agree about your point about unknown calls.
For what it's worth I tried rewriting map with local recursion, and while we got fewer unknown calls it wasn't clear if it's overall an improvement. So likely no low hanging fruit there. We should definitely try to find out where these calls (and reverse, ++) come from. Maybe there are a few hotspots which can be refactored to make them better.
I think we should use lists much less. They are always the wrong data structure and only really useful when fused as a control structure. In the latter case, we could arguably should have newtype FB a = { foldr_ :: forall r. (a -> r -> r) -> r -> r } rather than all these rewrite rules which effectively rewrite to this type and back. To make matters worse, the compiler often can't decide whether to share such a list (in which case fusion fails) or not.
If we only had FB, then we just needed a conversion function fuse :: Seq a -> FB a to enable explicit fusion. And probably a better, more cache-friendly version of Seq based on RRB trees. Bonus: No need for Bag, OrdList, etc. RRB trees should be vastly superior in all use cases. Also no messing about with seqList.
I see that refactoring all uses of lists within GHC is daunting. But we could do this incrementally!
If there are hints of missed specialization, it may be worth it to compile with -fexpose-all-unfoldings -fspecialise-aggressively -fsimpl-tick-factor=200 -Wall-missed-specialisations and compare the results. These options are likely to improve performance of any code written polymorphically with constraints in place of monad transformer stacks, which is not the case here, but it doesn't hurt to check just in case.
For what it's worth, the "bottom-up" profiling mechanism implemented in !3871 (closed) is quite handy for tracking down things like mysterious (++) and reverse uses. Concretely, after building that branch with
One gets a profile containing cost-centers for all call-sites of map and reverse. One can then easily extract a sorted list of cost-centers with bgamari/bottom-up-analysis>:
In the case of reverse (excerpted above) we see that the overwhelming majority of uses come from various occurrences of the "accumulator" pattern. For instance, in collectArgsTicks we have:
It's quite unclear to me whether this can be improved. DList is of course an option, but I suspect this will end up being a very similar to the accumulator pattern in its performance characteristics. A structure like Data.Sequence is of course another option, but this comes with far larger constant factors which I suspect won't work out well with the small lists that we are typically working with.
It might be interesting to try the "keep the list on the stack version". It would look like this
go (App f a) = let !(e', as, ts) = go f in (e', a:as, ts)go (Tick t e) | skipTick t = let !(e', as, ts)= go e in (e', as, t:ts)go e = (e, [], [])
There's a lot of building a triple and taking it apart again, but I think that CPR analysis would nail that. So it might be faster. Perhaps worth profiling a micro-benchmark.
This is a good point; in this case the model of collectArgsTicks in my benchmark is actually incorrect (or rather, the names in the source are slightly misleading).
One tricky consideration here is that we usually don't have any ticks in a typical program. Consequently, the reverse call will be very cheap indeed. This raises the question of whether this is the right place to be focusing optimisation effort.
I quickly hacked together a microbenchmark (bgamari/append-benchmark>) examining the performance of Sequence, DList, and reverse in the "list rebuild" basic usage found in the accumulator pattern above. The results are somewhat interesting:
for long lists (above 20 elements or so) Seq is slower than both DList and reverse. This is opposite of what my intuition expected.
for short lists (5 or fewer elements) Seq is slightly faster
I also studied a model of the collectArgsTicks case. I'm working on summarizing the results of this.
I would not expect Data.Sequence to make a particularly good list-building monoid. It's better built when you actually use it as a sequence. A catenable list with fat nodes might work—something like Data.Sequence but filled with larger chunks of elements. You don't need splitting and such, I assume, so having chunks of varying size is okay.
Edit: Also note that they admit efficient transient (as opposed to persistent) semantics that we could safely expose through an ST-based or even linear type-based approach. Ed Kmett wrote an unfinished implementation a while ago.