Skip to content

The size of FastString table is suboptimal for large codebases

Context: I was profiling a no-op rebuild (build once, do incremental build with nothing changed) of a large code base. It takes 35.829s, and according to the profile about 20% of the time is spent in mkFastStringByteString.

I run it with -dfaststring-stats +RTS -s and got:

FastString stats:
    size:            4091
    entries:         829959
    longest chain:   268
    has z-encoding:  0%
  50,011,906,640 bytes allocated in the heap
   7,936,277,664 bytes copied during GC
   3,106,526,336 bytes maximum residency (8 sample(s))
     123,108,224 bytes maximum slop
           10169 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       117 colls,    37 par   32.390s   2.293s     0.0196s    0.0911s
  Gen  1         8 colls,     3 par   19.722s   2.266s     0.2832s    0.6936s

  Parallel GC work balance: 60.62% (serial 0%, perfect 100%)

  TASKS: 93 (1 bound, 92 peak workers (92 total), using -N30)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.009s  (  0.009s elapsed)
  MUT     time  277.370s  ( 31.261s elapsed)
  GC      time   52.113s  (  4.558s elapsed)
  EXIT    time    0.005s  (  0.001s elapsed)
  Total   time  329.497s  ( 35.829s elapsed)

  Alloc rate    180,307,295 bytes per MUT second

  Productivity  84.2% of total user, 87.3% of total elapsed	

That's pretty bad, fitting 800k elements in 4k buckets is hard.

To confirm this indeed could be improved I changed the size to 2^20 (1048576) and also removed division in hashStr function (it come up on perf profile), like this:

hashStr  :: Ptr Word8 -> Int -> Int
 -- use the Addr to produce a hash value between 0 & m (inclusive)
hashStr (Ptr a#) (I# len#) = loop 0# 0#
   where
    loop h n | isTrue# (n ==# len#) = I# h
             | otherwise  = loop h2 (n +# 1#)
          where !c = ord# (indexCharOffAddr# a# n)
                -- !h2 = (c +# (h *# 128#)) `remInt#`
                --       hASH_TBL_SIZE#
                -- NOTE: below is multiplication by 127 (a prime) and division by 2^20, it's by no means the best hashing function
                !h2 = (c +# ((h `uncheckedIShiftL#` 7#) -# h)) `andI#` hASH_TBL_SIZE_MASK#

Here's what I ended up with:

FastString stats:
    size:            1048576
    entries:         829959
    longest chain:   8
    has z-encoding:  0%
  51,012,218,048 bytes allocated in the heap
   8,948,377,768 bytes copied during GC
   3,543,797,248 bytes maximum residency (8 sample(s))
     139,428,352 bytes maximum slop
           10502 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       115 colls,    39 par   32.261s   2.677s     0.0233s    0.2225s
  Gen  1         8 colls,     3 par   21.786s   2.329s     0.2911s    0.5737s

  Parallel GC work balance: 48.75% (serial 0%, perfect 100%)

  TASKS: 93 (1 bound, 92 peak workers (92 total), using -N30)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.008s  (  0.008s elapsed)
  MUT     time   65.508s  ( 23.791s elapsed)
  GC      time   54.047s  (  5.006s elapsed)
  EXIT    time    0.007s  (  0.001s elapsed)
  Total   time  119.570s  ( 28.806s elapsed)

  Alloc rate    778,712,134 bytes per MUT second

  Productivity  54.8% of total user, 82.6% of total elapsed

The realtime improvement is nice (-20%), but even nicer is the "serial" time improvement, from 329.497sto 119.570s (2.75x speedup).

This is on modified 8.0.2, but the related code looks about the same on HEAD.

Trac metadata
Trac field Value
Version
Type Bug
TypeOfFailure OtherFailure
Priority normal
Resolution Unresolved
Component Compiler
Test case
Differential revisions
BlockedBy
Related
Blocking
CC simonmar
Operating system
Architecture
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information