GHC exploits superoptimal insertion patterns of IntMaps

Abstract (Simon's digest)

GHC tends to allocate fresh Uniques in ascending order
And hence (as it turns out) often insert them in ascending order
This pattern makes IntMap behave better than its average case.
The effect is significant: scrambling the key force the average-case behaviour makes GHC go 2-5% slower.

This is an attempt to summarise and track issues related to a kind of "anti-pathological" IntMap performance pattern.

Here is a list of issues that we think are the result of this effect:

#19414 (closed) wonders about some test cases varying wieldy depending on -dunique-increment (see also !11499)
A previous attempt at characterising this issue was #20222 (closed), but the hypothesis was wrong
Modifying the in-scope set is costly #19820 (again, unbalanced IntMaps is probably the wrong hypothesis)

Summary

As discovered in #20222 (comment 532656) and reproduced in !11499, inserts into an IntMap are much less costly when inserting at a key that is either above or below any other previously inserted keys. Consequently, the order in which we insert keys into an IntMap can have pronounced effect on allocation/performance, in the range of 1-2x allocations (depending on a log factor of the input size, presumably, so it could in theory go beyond 2x). It turns out that GHC often inserts keys in ascending order because the uniques it draws are increasing, thus benefitting from superoptimality.

Very briefly and pointedly, assuming a big endian PATRICIA trie impl as we have,

IntMap.fromList [0,1,2,3] is fast (allocates 1+1+2=4 Bin constructors in total)
IntMap.fromList [3,2,1,0] is fast (allocates 1+1+2=4 Bin constructors in total)
IntMap.fromList [0,2,3,1] is slow (allocates 1+2+2=5 Bin constructors in total)
IntMap.fromList [3,1,2,0] is slow (allocates 1+2+2=5 Bin constructors in total)

(Note that IntMap.fromList = foldr (uncurry IntMap.insert) IntMap.empty.)

This doesn't look like much, but when we are inserting a key 2^b into fromList [0..2^b-1], we just need to allocate a single Bin constructor despite the height of the tree being b. In this insertion pattern, we always skip full left sub-trees. Essentially, whenever a carry over to bit i in the key happens, we need to allocate O(i) fewer Bin constructors than the worst-case asymptotic optimum; thus we leverage an "anti-pathological" or superoptimal input pattern of the IntMap data structure, often allocating much less than O(log n) nodes upon insertion.

It turns out that almost all of our performance tests exhibit this insertion pattern, as this benefit goes away when picking a very large -dunique-increment that destroys the carry over effect: !11499 reports a ghc/alloc increase of 2.6% in the geom. mean (and as much as 13.2% in test case T3294).

Reproduction

Here's a simple benchmark originally from #20222 (comment 532656) that reproduces the effect. Comment in the different definitions for xs and run the artifact with +RTS -s:

import qualified Data.IntMap.Strict as M
import Data.Bits

-- | https://gist.github.com/degski/6e2069d6035ae04d5d6f64981c995ec2
mix :: Int -> Int -> Int
{-# INLINE mix #-}
mix k = fromIntegral . f . g . f . g . f . fromIntegral
  where
    f y = (y `shiftR` s) `xor` y
    g z = z * k
    s = finiteBitSize k `shiftR` 1

kFORWARD, kBACKWARD :: Int
-- These are like "encryption" and "decryption" keys to mix
kFORWARD  = 0xD6E8FEB86659FD93
kBACKWARD = 0xCFEE444D8B59A89B
enc, dec :: Int -> Int
enc = mix kFORWARD
dec = mix kBACKWARD
{-# NOINLINE enc #-}
{-# NOINLINE dec #-}

fromList = foldr (uncurry M.insert) M.empty
{-# NOINLINE fromList #-}

main = do
  let xs = map (\k -> (k,k)) $ [1..108880]
--  let xs = map (\k -> (enc k,k)) $ [1..108880]
--  let xs = map (\k -> (k,k)) $ take 108880 $ iterate (\n -> n*43 `mod` 108881) 1
        -- I verified that 43 is a primitive root of 108881, so `iterate` produces a pseudo-random order of [1..108880]
--  let xs = map (\k -> (k,k)) $ [108880,108879..1]
  print $ M.foldr (+) 0 $ fromList xs

The baseline [1..108880] allocates 53MB when compiled with -O1 and run, while the form with enc takes 87MB. To verify that this is actually an effect of shuffling rather just the bigger space of non-0 bits, I added the iterate form which generates a shuffle of [1..108880]. It takes 86MB, much like the encoded form. Finally, [108880,108879..0] without shuffling takes just 50MB (yes, even less).

Consequences

Most alternative data structures such as HashMaps or ordered Maps can't leverage the input patterns discussed above. As such, they have an inherent disadvantage over IntMaps, explaining measurements such as #20222 (comment 378458) (+50% allocs and runtime when compiling Cabal with HashMaps).

At the same time, superoptimality is sometimes a brittle property to preserve as evidenced by T12545. That is what #19414 (closed) tracks: the brittleness of preserving superoptimality, not so much the superoptimality itself, of which T3294 is a big beneficiary. However, T3294 does not seem to be so brittle as far as we can tell.

It is hard to say how to act on this data. Should we

Run all our perf test cases with a big -dunique-increment to eliminate superoptimality from our measurements, thus fixing #19414 (closed)? But then our measurements are not representative of real GHC perf! And we might indeed introduce changes that destroy superoptimality for good without noticing in CI. So perhaps this is not an option and we have to live with #19414 (closed).
Should we switch to a less input-sensitive data structure such as ordered maps, or apply a mixing function to keys just to get deterministic performance? !11470 does this and suggests we leave a bit of performance on the table. (But not as much as 50% as for HashMaps.)
https://pp.ipd.kit.edu/uploads/publikationen/ellinger22bachelorarbeit.pdf#chapter.3 employs a transient version of WordMap to GHC's InScopeSet to allocate less internal nodes. That seems like a good way forward, especially now that we have vendored our own WordMap structure, meaning that conversion to ephemeral will have next to no overhead (instead of copying and rebuilding the whole tree, as that thesis needed to for its experiment).

For now, I just wanted to document this pathology in its own ticket so that we have a single organised place to link to.

Edited Oct 30, 2023 by Sebastian Graf

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information