The size of FastString table is suboptimal for large codebases
Context:
I was profiling a no-op rebuild (build once, do incremental build with nothing changed) of a large code base. It takes 35.829s
, and according to the profile about 20%
of the time is spent in mkFastStringByteString
.
I run it with -dfaststring-stats +RTS -s
and got:
FastString stats:
size: 4091
entries: 829959
longest chain: 268
has z-encoding: 0%
50,011,906,640 bytes allocated in the heap
7,936,277,664 bytes copied during GC
3,106,526,336 bytes maximum residency (8 sample(s))
123,108,224 bytes maximum slop
10169 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 117 colls, 37 par 32.390s 2.293s 0.0196s 0.0911s
Gen 1 8 colls, 3 par 19.722s 2.266s 0.2832s 0.6936s
Parallel GC work balance: 60.62% (serial 0%, perfect 100%)
TASKS: 93 (1 bound, 92 peak workers (92 total), using -N30)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.009s ( 0.009s elapsed)
MUT time 277.370s ( 31.261s elapsed)
GC time 52.113s ( 4.558s elapsed)
EXIT time 0.005s ( 0.001s elapsed)
Total time 329.497s ( 35.829s elapsed)
Alloc rate 180,307,295 bytes per MUT second
Productivity 84.2% of total user, 87.3% of total elapsed
That's pretty bad, fitting 800k
elements in 4k
buckets is hard.
To confirm this indeed could be improved I changed the size to 2^20
(1048576
) and also removed division in hashStr
function (it come up on perf
profile), like this:
hashStr :: Ptr Word8 -> Int -> Int
-- use the Addr to produce a hash value between 0 & m (inclusive)
hashStr (Ptr a#) (I# len#) = loop 0# 0#
where
loop h n | isTrue# (n ==# len#) = I# h
| otherwise = loop h2 (n +# 1#)
where !c = ord# (indexCharOffAddr# a# n)
-- !h2 = (c +# (h *# 128#)) `remInt#`
-- hASH_TBL_SIZE#
-- NOTE: below is multiplication by 127 (a prime) and division by 2^20, it's by no means the best hashing function
!h2 = (c +# ((h `uncheckedIShiftL#` 7#) -# h)) `andI#` hASH_TBL_SIZE_MASK#
Here's what I ended up with:
FastString stats:
size: 1048576
entries: 829959
longest chain: 8
has z-encoding: 0%
51,012,218,048 bytes allocated in the heap
8,948,377,768 bytes copied during GC
3,543,797,248 bytes maximum residency (8 sample(s))
139,428,352 bytes maximum slop
10502 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 115 colls, 39 par 32.261s 2.677s 0.0233s 0.2225s
Gen 1 8 colls, 3 par 21.786s 2.329s 0.2911s 0.5737s
Parallel GC work balance: 48.75% (serial 0%, perfect 100%)
TASKS: 93 (1 bound, 92 peak workers (92 total), using -N30)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.008s ( 0.008s elapsed)
MUT time 65.508s ( 23.791s elapsed)
GC time 54.047s ( 5.006s elapsed)
EXIT time 0.007s ( 0.001s elapsed)
Total time 119.570s ( 28.806s elapsed)
Alloc rate 778,712,134 bytes per MUT second
Productivity 54.8% of total user, 82.6% of total elapsed
The realtime improvement is nice (-20%
), but even nicer is the "serial" time improvement, from 329.497s
to 119.570s
(2.75x
speedup).
This is on modified 8.0.2, but the related code looks about the same on HEAD.
Trac metadata
Trac field | Value |
---|---|
Version | |
Type | Bug |
TypeOfFailure | OtherFailure |
Priority | normal |
Resolution | Unresolved |
Component | Compiler |
Test case | |
Differential revisions | |
BlockedBy | |
Related | |
Blocking | |
CC | simonmar |
Operating system | |
Architecture |