Improve FastString memory behaviour
This PR changes the FastString implemetation to use ShortByteString
instead of ByteString
and removes the cache for lengthFS
. It also pulls in the lazy z-encoding cache PR !1039 (closed).
There are multiple reasons we want this:
- Fewe allocations: ByteString has 3 fields, ShortByteString just has one and removing fs_nchars removes another one.
- ByteString memory is pinned:
- This can cause fragmentation issues (see for example #13110) but also
- makes using FastStrings in compact regions impossible.
Merge request reports
Activity
1358 1358 dataConIdentity :: DataCon -> ByteString 1359 1359 -- We want this string to be UTF-8, so we get the bytes directly from the FastStrings. 1360 1360 dataConIdentity dc = LBS.toStrict $ BSB.toLazyByteString $ mconcat By default bytestring allocates 4kb thunks. I'm fairly certain we could change the default to 512bytes here and still get away with using a single thunk in 99.9% of cases.
Edited by Andreas KlebingerThe result of this function will likely be small "GHC.Base.Int" or the like.
However the bytestring builder internall allocates 4kb of memory to run in by default inside
toLazyByteString
.We can just set the default size for the initial chunk to eg 512byte instead and it should safe a fair bit of allocations.
But if this code isn't very performance sensitive (I don't know if it is) then it should be ok either way.
changed this line in version 15 of the diff
- Resolved by Daniel Gröber (dxld)
added compiler perf label
- Resolved by Daniel Gröber (dxld)
Aaaaah, yes. I see the issue now.
withForeignPtr
doesn't provide the correct scoping here because the actual decoding happens when the thunk we return is forced! The old code hadtouchForeignPtr
inside the thunk to deal with that.Edited by Daniel Gröber (dxld)added 126 commits
-
f3612de1...795986aa - 124 commits from branch
ghc:master
- 58cc5b11 - Remove length field from FastString
- 990f2327 - Use ShortByteString for FastString
-
f3612de1...795986aa - 124 commits from branch
I built this branch and compared the performance of
FastString
on a microbenchmark.https://gist.github.com/248ffebee2cb79060413fc8c54da7037
With patch vs after patch
[nix-shell:~/ghc]$ ./FastStringBench +RTS -s 0 1 2 3 4 5 29,075,714,656 bytes allocated in the heap 4,374,332,568 bytes copied during GC 227,497,416 bytes maximum residency (22 sample(s)) 1,739,800 bytes maximum slop 216 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 27954 colls, 0 par 1.697s 1.695s 0.0001s 0.0008s Gen 1 22 colls, 0 par 1.382s 1.382s 0.0628s 0.1927s INIT time 0.000s ( 0.000s elapsed) MUT time 9.178s ( 9.180s elapsed) GC time 3.080s ( 3.077s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 12.258s ( 12.258s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 3,167,986,100 bytes per MUT second Productivity 74.9% of total user, 74.9% of total elapsed [nix-shell:~/ghc]$ ghc FastStringBench.hs -fforce-recomp -package ghc [1 of 1] Compiling Main ( FastStringBench.hs, FastStringBench.o ) Linking FastStringBench ... [nix-shell:~/ghc]$ ./FastStringBench +RTS -s 0 1 2 3 4 5 25,332,708,016 bytes allocated in the heap 4,504,791,880 bytes copied during GC 272,247,280 bytes maximum residency (22 sample(s)) 1,022,144 bytes maximum slop 259 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 24340 colls, 0 par 1.572s 1.566s 0.0001s 0.0003s Gen 1 22 colls, 0 par 1.302s 1.302s 0.0592s 0.2057s INIT time 0.000s ( 0.000s elapsed) MUT time 87.953s ( 87.958s elapsed) GC time 2.874s ( 2.868s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 90.828s ( 90.827s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 288,024,213 bytes per MUT second Productivity 96.8% of total user, 96.8% of total elapsed
I just can't seem to figure out the reason for the CI stat failures.
Here's the stat failures, all increases in the "bytes allocated" metric:
compile_time/bytes allocated Increase from x86_64-linux-deb9-dwarf baseline @ HEAD~2: Expected T5837 (normal) compile_time/bytes allocated: 56072736.0 +/-10% Lower bound T5837 (normal) compile_time/bytes allocated: 50465462 Upper bound T5837 (normal) compile_time/bytes allocated: 61680010 Actual T5837 (normal) compile_time/bytes allocated: 62392160 Deviation T5837 (normal) compile_time/bytes allocated: 11.3 % compile_time/bytes allocated Increase from x86_64-linux-deb9-dwarf baseline @ HEAD~2: Expected T12234 (optasm) compile_time/bytes allocated: 88864056.0 +/-5% Lower bound T12234 (optasm) compile_time/bytes allocated: 84420853 Upper bound T12234 (optasm) compile_time/bytes allocated: 93307259 Actual T12234 (optasm) compile_time/bytes allocated: 94745504 Deviation T12234 (optasm) compile_time/bytes allocated: 6.6 % compile_time/bytes allocated Increase from x86_64-linux-deb9-dwarf baseline @ HEAD~2: Expected T13035 (normal) compile_time/bytes allocated: 128738208.0 +/-5% Lower bound T13035 (normal) compile_time/bytes allocated: 122301297 Upper bound T13035 (normal) compile_time/bytes allocated: 135175119 Actual T13035 (normal) compile_time/bytes allocated: 136245056 Deviation T13035 (normal) compile_time/bytes allocated: 5.8 % compile_time/bytes allocated Increase from x86_64-linux-deb9-dwarf baseline @ HEAD~2: Expected T12150 (optasm) compile_time/bytes allocated: 81576184.0 +/-5% Lower bound T12150 (optasm) compile_time/bytes allocated: 77497374 Upper bound T12150 (optasm) compile_time/bytes allocated: 85654994 Actual T12150 (optasm) compile_time/bytes allocated: 87360184 Deviation T12150 (optasm) compile_time/bytes allocated: 7.1 % compile_time/bytes allocated Increase from x86_64-linux-deb9-dwarf baseline @ HEAD~2: Expected Naperian (optasm) compile_time/bytes allocated: 59767316.0 +/-10% Lower bound Naperian (optasm) compile_time/bytes allocated: 53790584 Upper bound Naperian (optasm) compile_time/bytes allocated: 65744048 Actual Naperian (optasm) compile_time/bytes allocated: 66892328 Deviation Naperian (optasm) compile_time/bytes allocated: 11.9 % compile_time/bytes allocated Increase from x86_64-linux-deb9-dwarf baseline @ HEAD~2: Expected T11303b (normal) compile_time/bytes allocated: 57709692.0 +/-10% Lower bound T11303b (normal) compile_time/bytes allocated: 51938722 Upper bound T11303b (normal) compile_time/bytes allocated: 63480662 Actual T11303b (normal) compile_time/bytes allocated: 64602824 Deviation T11303b (normal) compile_time/bytes allocated: 11.9 %
I also tried compiling Cabal
master
50,855,728,776 bytes allocated in the heap 4,134,339,512 bytes copied during GC 378,525,600 bytes maximum residency (18 sample(s)) 2,619,488 bytes maximum slop 816 MiB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 1418 colls, 0 par 5.010s 5.013s 0.0035s 0.0926s Gen 1 18 colls, 0 par 1.421s 1.422s 0.0790s 0.2721s TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.000s ( 0.000s elapsed) MUT time 28.604s ( 33.689s elapsed) GC time 6.431s ( 6.435s elapsed) EXIT time 0.001s ( 0.006s elapsed) Total time 35.036s ( 40.130s elapsed) Alloc rate 1,777,950,299 bytes per MUT second Productivity 81.6% of total user, 83.9% of total elapsed
with short-bytestring
51,056,298,936 bytes allocated in the heap 4,524,863,880 bytes copied during GC 305,863,448 bytes maximum residency (21 sample(s)) 2,155,752 bytes maximum slop 847 MiB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 1699 colls, 0 par 5.395s 5.397s 0.0032s 0.1066s Gen 1 21 colls, 0 par 1.620s 1.620s 0.0772s 0.2923s TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.001s ( 0.000s elapsed) MUT time 28.493s ( 33.571s elapsed) GC time 7.015s ( 7.017s elapsed) EXIT time 0.001s ( 0.001s elapsed) Total time 35.508s ( 40.590s elapsed) Alloc rate 1,791,918,128 bytes per MUT second Productivity 80.2% of total user, 82.7% of total elapsed
The "bytes allocated" metric also increases here, however with short-bytestring I have significantly less maximum residency 305M vs 378M (this is consistent across runs) so that's good.
One other thing that stands out is with short-bytestring I have quite a bit more "bytes copied during GC" which kind of makes sense, all the FastString's data used to pinned and now isn't so there would be more potential for them to be copied around during GC. That makes me think, is this extra stuff being copied around also counting towards "bytes allocated"? As in you'd have to allocate the destination of the GC copy so that would increase the ammount of allocations in my mental model.
Does that sufficiently explain the stat increase or am I way off?
Edited by Daniel Gröber (dxld)