Improve FastString memory behaviour

By default bytestring allocates 4kb thunks. I'm fairly certain we could change the default to 512bytes here and still get away with using a single thunk in 99.9% of cases.

I'm a bit confused about what this comment has to do with this code, can you give me some more context?

The result of this function will likely be small "GHC.Base.Int" or the like.

However the bytestring builder internall allocates 4kb of memory to run in by default inside toLazyByteString.

See here

We can just set the default size for the initial chunk to eg 512byte instead and it should safe a fair bit of allocations.

But if this code isn't very performance sensitive (I don't know if it is) then it should be ok either way.

changed this line in version 15 of the diff

Someone needs to debug the very confusing CI error..

marked as a Work In Progress

changed the description

added compiler perf label

@DanielG

I managed to fix the issue with the build - see 949a0eac

This doesn't seem like the best way but at least the problem has been identified.

Aaaaah, yes. I see the issue now.

withForeignPtr doesn't provide the correct scoping here because the actual decoding happens when the thunk we return is forced! The old code had touchForeignPtr inside the thunk to deal with that.

added 1 commit

8c7c0468 - Use ShortByteString for FastString

Compare with previous version

added 1 commit

f3612de1 - Use ShortByteString for FastString

Compare with previous version

added 126 commits

f3612de1...795986aa - 124 commits from branch ghc:master
58cc5b11 - Remove length field from FastString
990f2327 - Use ShortByteString for FastString

Compare with previous version

Changes:

Rebase on master
Fix fastStringToByteString deprecation warnings

unmarked as a Work In Progress

I built this branch and compared the performance of FastString on a microbenchmark.

https://gist.github.com/248ffebee2cb79060413fc8c54da7037

With patch vs after patch

[nix-shell:~/ghc]$ ./FastStringBench +RTS -s
0
1
2
3
4
5
  29,075,714,656 bytes allocated in the heap
   4,374,332,568 bytes copied during GC
     227,497,416 bytes maximum residency (22 sample(s))
       1,739,800 bytes maximum slop
             216 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     27954 colls,     0 par    1.697s   1.695s     0.0001s    0.0008s
  Gen  1        22 colls,     0 par    1.382s   1.382s     0.0628s    0.1927s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    9.178s  (  9.180s elapsed)
  GC      time    3.080s  (  3.077s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   12.258s  ( 12.258s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    3,167,986,100 bytes per MUT second

  Productivity  74.9% of total user, 74.9% of total elapsed


[nix-shell:~/ghc]$ ghc FastStringBench.hs -fforce-recomp -package ghc
[1 of 1] Compiling Main             ( FastStringBench.hs, FastStringBench.o )
Linking FastStringBench ...

[nix-shell:~/ghc]$ ./FastStringBench +RTS -s
0
1
2
3
4
5
  25,332,708,016 bytes allocated in the heap
   4,504,791,880 bytes copied during GC
     272,247,280 bytes maximum residency (22 sample(s))
       1,022,144 bytes maximum slop
             259 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     24340 colls,     0 par    1.572s   1.566s     0.0001s    0.0003s
  Gen  1        22 colls,     0 par    1.302s   1.302s     0.0592s    0.2057s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   87.953s  ( 87.958s elapsed)
  GC      time    2.874s  (  2.868s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time   90.828s  ( 90.827s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    288,024,213 bytes per MUT second

  Productivity  96.8% of total user, 96.8% of total elapsed

added 1 commit

d3d3ebdc - Use ShortByteString for FastString

Compare with previous version

Changes:

Implement countUTF8Chars in terms of ShortByteString not via ByteString conversion

added 1 commit

d648cc13 - Use ShortByteString for FastString

Compare with previous version

Changes:

Fix some warnings

I just can't seem to figure out the reason for the CI stat failures.

Here's the stat failures, all increases in the "bytes allocated" metric:

compile_time/bytes allocated Increase from x86_64-linux-deb9-dwarf baseline @ HEAD~2:
    Expected    T5837 (normal) compile_time/bytes allocated: 56072736.0 +/-10%
    Lower bound T5837 (normal) compile_time/bytes allocated:   50465462 
    Upper bound T5837 (normal) compile_time/bytes allocated:   61680010 
    Actual      T5837 (normal) compile_time/bytes allocated:   62392160 
    Deviation   T5837 (normal) compile_time/bytes allocated:       11.3 %

compile_time/bytes allocated Increase from x86_64-linux-deb9-dwarf baseline @ HEAD~2:
    Expected    T12234 (optasm) compile_time/bytes allocated: 88864056.0 +/-5%
    Lower bound T12234 (optasm) compile_time/bytes allocated:   84420853 
    Upper bound T12234 (optasm) compile_time/bytes allocated:   93307259 
    Actual      T12234 (optasm) compile_time/bytes allocated:   94745504 
    Deviation   T12234 (optasm) compile_time/bytes allocated:        6.6 %

compile_time/bytes allocated Increase from x86_64-linux-deb9-dwarf baseline @ HEAD~2:
    Expected    T13035 (normal) compile_time/bytes allocated: 128738208.0 +/-5%
    Lower bound T13035 (normal) compile_time/bytes allocated:   122301297 
    Upper bound T13035 (normal) compile_time/bytes allocated:   135175119 
    Actual      T13035 (normal) compile_time/bytes allocated:   136245056 
    Deviation   T13035 (normal) compile_time/bytes allocated:         5.8 %

compile_time/bytes allocated Increase from x86_64-linux-deb9-dwarf baseline @ HEAD~2:
    Expected    T12150 (optasm) compile_time/bytes allocated: 81576184.0 +/-5%
    Lower bound T12150 (optasm) compile_time/bytes allocated:   77497374 
    Upper bound T12150 (optasm) compile_time/bytes allocated:   85654994 
    Actual      T12150 (optasm) compile_time/bytes allocated:   87360184 
    Deviation   T12150 (optasm) compile_time/bytes allocated:        7.1 %

compile_time/bytes allocated Increase from x86_64-linux-deb9-dwarf baseline @ HEAD~2:
    Expected    Naperian (optasm) compile_time/bytes allocated: 59767316.0 +/-10%
    Lower bound Naperian (optasm) compile_time/bytes allocated:   53790584 
    Upper bound Naperian (optasm) compile_time/bytes allocated:   65744048 
    Actual      Naperian (optasm) compile_time/bytes allocated:   66892328 
    Deviation   Naperian (optasm) compile_time/bytes allocated:       11.9 %

compile_time/bytes allocated Increase from x86_64-linux-deb9-dwarf baseline @ HEAD~2:
    Expected    T11303b (normal) compile_time/bytes allocated: 57709692.0 +/-10%
    Lower bound T11303b (normal) compile_time/bytes allocated:   51938722 
    Upper bound T11303b (normal) compile_time/bytes allocated:   63480662 
    Actual      T11303b (normal) compile_time/bytes allocated:   64602824 
    Deviation   T11303b (normal) compile_time/bytes allocated:       11.9 %

I also tried compiling Cabal

master

  50,855,728,776 bytes allocated in the heap
   4,134,339,512 bytes copied during GC
     378,525,600 bytes maximum residency (18 sample(s))
       2,619,488 bytes maximum slop
             816 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1418 colls,     0 par    5.010s   5.013s     0.0035s    0.0926s
  Gen  1        18 colls,     0 par    1.421s   1.422s     0.0790s    0.2721s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time   28.604s  ( 33.689s elapsed)
  GC      time    6.431s  (  6.435s elapsed)
  EXIT    time    0.001s  (  0.006s elapsed)
  Total   time   35.036s  ( 40.130s elapsed)

  Alloc rate    1,777,950,299 bytes per MUT second

  Productivity  81.6% of total user, 83.9% of total elapsed

with short-bytestring

  51,056,298,936 bytes allocated in the heap
   4,524,863,880 bytes copied during GC
     305,863,448 bytes maximum residency (21 sample(s))
       2,155,752 bytes maximum slop
             847 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1699 colls,     0 par    5.395s   5.397s     0.0032s    0.1066s
  Gen  1        21 colls,     0 par    1.620s   1.620s     0.0772s    0.2923s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.001s  (  0.000s elapsed)
  MUT     time   28.493s  ( 33.571s elapsed)
  GC      time    7.015s  (  7.017s elapsed)
  EXIT    time    0.001s  (  0.001s elapsed)
  Total   time   35.508s  ( 40.590s elapsed)

  Alloc rate    1,791,918,128 bytes per MUT second

  Productivity  80.2% of total user, 82.7% of total elapsed

The "bytes allocated" metric also increases here, however with short-bytestring I have significantly less maximum residency 305M vs 378M (this is consistent across runs) so that's good.

One other thing that stands out is with short-bytestring I have quite a bit more "bytes copied during GC" which kind of makes sense, all the FastString's data used to pinned and now isn't so there would be more potential for them to be copied around during GC. That makes me think, is this extra stuff being copied around also counting towards "bytes allocated"? As in you'd have to allocate the destination of the GC copy so that would increase the ammount of allocations in my mental model.

Does that sufficiently explain the stat increase or am I way off?

 dataConIdentity :: DataCon -> ByteString
 -- We want this string to be UTF-8, so we get the bytes directly from the FastStrings.
 dataConIdentity dc = LBS.toStrict $ BSB.toLazyByteString $ mconcat

Improve FastString memory behaviour

Closed by Marge Bot 4 years ago (Jul 23, 2020 12:18am UTC) 4 years ago

Activity

Improve FastString memory behaviour

Merge request reports

Closed by Marge Bot 4 years ago (Jul 23, 2020 12:18am UTC) 4 years ago

Activity