Skip to content
Snippets Groups Projects

Improve FastString memory behaviour

Closed Daniel Gröber (dxld) requested to merge DanielG/ghc:short-bytestring into master

This PR changes the FastString implemetation to use ShortByteString instead of ByteString and removes the cache for lengthFS. It also pulls in the lazy z-encoding cache PR !1039 (closed).

There are multiple reasons we want this:

  • Fewe allocations: ByteString has 3 fields, ShortByteString just has one and removing fs_nchars removes another one.
  • ByteString memory is pinned:
    • This can cause fragmentation issues (see for example #13110) but also
    • makes using FastStrings in compact regions impossible.

cc @mpickering @bgamari

Edited by Daniel Gröber (dxld)

Merge request reports

Merge request pipeline #22496 passed with warnings

Merge request pipeline passed with warnings for 27dfcc8e

Closed by Marge BotMarge Bot 4 years ago (Jul 23, 2020 12:18am UTC)

Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
1358 1358 dataConIdentity :: DataCon -> ByteString
1359 1359 -- We want this string to be UTF-8, so we get the bytes directly from the FastStrings.
1360 1360 dataConIdentity dc = LBS.toStrict $ BSB.toLazyByteString $ mconcat
  • Someone needs to debug the very confusing CI error..

  • Ben Gamari marked as a Work In Progress

    marked as a Work In Progress

  • Ben Gamari changed the description

    changed the description

  • @DanielG

    I managed to fix the issue with the build - see 949a0eac

    This doesn't seem like the best way but at least the problem has been identified.

  • Aaaaah, yes. I see the issue now.

    withForeignPtr doesn't provide the correct scoping here because the actual decoding happens when the thunk we return is forced! The old code had touchForeignPtr inside the thunk to deal with that.

    Edited by Daniel Gröber (dxld)
  • added 1 commit

    • 8c7c0468 - Use ShortByteString for FastString

    Compare with previous version

  • added 1 commit

    • f3612de1 - Use ShortByteString for FastString

    Compare with previous version

  • added 126 commits

    Compare with previous version

  • Changes:

    • Rebase on master
    • Fix fastStringToByteString deprecation warnings
  • Daniel Gröber (dxld) unmarked as a Work In Progress

    unmarked as a Work In Progress

  • I built this branch and compared the performance of FastString on a microbenchmark.

    https://gist.github.com/248ffebee2cb79060413fc8c54da7037

    With patch vs after patch

    [nix-shell:~/ghc]$ ./FastStringBench +RTS -s
    0
    1
    2
    3
    4
    5
      29,075,714,656 bytes allocated in the heap
       4,374,332,568 bytes copied during GC
         227,497,416 bytes maximum residency (22 sample(s))
           1,739,800 bytes maximum slop
                 216 MB total memory in use (0 MB lost due to fragmentation)
    
                                         Tot time (elapsed)  Avg pause  Max pause
      Gen  0     27954 colls,     0 par    1.697s   1.695s     0.0001s    0.0008s
      Gen  1        22 colls,     0 par    1.382s   1.382s     0.0628s    0.1927s
    
      INIT    time    0.000s  (  0.000s elapsed)
      MUT     time    9.178s  (  9.180s elapsed)
      GC      time    3.080s  (  3.077s elapsed)
      EXIT    time    0.000s  (  0.000s elapsed)
      Total   time   12.258s  ( 12.258s elapsed)
    
      %GC     time       0.0%  (0.0% elapsed)
    
      Alloc rate    3,167,986,100 bytes per MUT second
    
      Productivity  74.9% of total user, 74.9% of total elapsed
    
    
    [nix-shell:~/ghc]$ ghc FastStringBench.hs -fforce-recomp -package ghc
    [1 of 1] Compiling Main             ( FastStringBench.hs, FastStringBench.o )
    Linking FastStringBench ...
    
    [nix-shell:~/ghc]$ ./FastStringBench +RTS -s
    0
    1
    2
    3
    4
    5
      25,332,708,016 bytes allocated in the heap
       4,504,791,880 bytes copied during GC
         272,247,280 bytes maximum residency (22 sample(s))
           1,022,144 bytes maximum slop
                 259 MB total memory in use (0 MB lost due to fragmentation)
    
                                         Tot time (elapsed)  Avg pause  Max pause
      Gen  0     24340 colls,     0 par    1.572s   1.566s     0.0001s    0.0003s
      Gen  1        22 colls,     0 par    1.302s   1.302s     0.0592s    0.2057s
    
      INIT    time    0.000s  (  0.000s elapsed)
      MUT     time   87.953s  ( 87.958s elapsed)
      GC      time    2.874s  (  2.868s elapsed)
      EXIT    time    0.000s  (  0.000s elapsed)
      Total   time   90.828s  ( 90.827s elapsed)
    
      %GC     time       0.0%  (0.0% elapsed)
    
      Alloc rate    288,024,213 bytes per MUT second
    
      Productivity  96.8% of total user, 96.8% of total elapsed
  • added 1 commit

    • d3d3ebdc - Use ShortByteString for FastString

    Compare with previous version

  • Changes:

    • Implement countUTF8Chars in terms of ShortByteString not via ByteString conversion
  • added 1 commit

    • d648cc13 - Use ShortByteString for FastString

    Compare with previous version

  • Changes:

    • Fix some warnings
  • I just can't seem to figure out the reason for the CI stat failures.

    Here's the stat failures, all increases in the "bytes allocated" metric:

    compile_time/bytes allocated Increase from x86_64-linux-deb9-dwarf baseline @ HEAD~2:
        Expected    T5837 (normal) compile_time/bytes allocated: 56072736.0 +/-10%
        Lower bound T5837 (normal) compile_time/bytes allocated:   50465462 
        Upper bound T5837 (normal) compile_time/bytes allocated:   61680010 
        Actual      T5837 (normal) compile_time/bytes allocated:   62392160 
        Deviation   T5837 (normal) compile_time/bytes allocated:       11.3 %
    
    compile_time/bytes allocated Increase from x86_64-linux-deb9-dwarf baseline @ HEAD~2:
        Expected    T12234 (optasm) compile_time/bytes allocated: 88864056.0 +/-5%
        Lower bound T12234 (optasm) compile_time/bytes allocated:   84420853 
        Upper bound T12234 (optasm) compile_time/bytes allocated:   93307259 
        Actual      T12234 (optasm) compile_time/bytes allocated:   94745504 
        Deviation   T12234 (optasm) compile_time/bytes allocated:        6.6 %
    
    compile_time/bytes allocated Increase from x86_64-linux-deb9-dwarf baseline @ HEAD~2:
        Expected    T13035 (normal) compile_time/bytes allocated: 128738208.0 +/-5%
        Lower bound T13035 (normal) compile_time/bytes allocated:   122301297 
        Upper bound T13035 (normal) compile_time/bytes allocated:   135175119 
        Actual      T13035 (normal) compile_time/bytes allocated:   136245056 
        Deviation   T13035 (normal) compile_time/bytes allocated:         5.8 %
    
    compile_time/bytes allocated Increase from x86_64-linux-deb9-dwarf baseline @ HEAD~2:
        Expected    T12150 (optasm) compile_time/bytes allocated: 81576184.0 +/-5%
        Lower bound T12150 (optasm) compile_time/bytes allocated:   77497374 
        Upper bound T12150 (optasm) compile_time/bytes allocated:   85654994 
        Actual      T12150 (optasm) compile_time/bytes allocated:   87360184 
        Deviation   T12150 (optasm) compile_time/bytes allocated:        7.1 %
    
    compile_time/bytes allocated Increase from x86_64-linux-deb9-dwarf baseline @ HEAD~2:
        Expected    Naperian (optasm) compile_time/bytes allocated: 59767316.0 +/-10%
        Lower bound Naperian (optasm) compile_time/bytes allocated:   53790584 
        Upper bound Naperian (optasm) compile_time/bytes allocated:   65744048 
        Actual      Naperian (optasm) compile_time/bytes allocated:   66892328 
        Deviation   Naperian (optasm) compile_time/bytes allocated:       11.9 %
    
    compile_time/bytes allocated Increase from x86_64-linux-deb9-dwarf baseline @ HEAD~2:
        Expected    T11303b (normal) compile_time/bytes allocated: 57709692.0 +/-10%
        Lower bound T11303b (normal) compile_time/bytes allocated:   51938722 
        Upper bound T11303b (normal) compile_time/bytes allocated:   63480662 
        Actual      T11303b (normal) compile_time/bytes allocated:   64602824 
        Deviation   T11303b (normal) compile_time/bytes allocated:       11.9 %

    I also tried compiling Cabal

    master

      50,855,728,776 bytes allocated in the heap
       4,134,339,512 bytes copied during GC
         378,525,600 bytes maximum residency (18 sample(s))
           2,619,488 bytes maximum slop
                 816 MiB total memory in use (0 MB lost due to fragmentation)
    
                                         Tot time (elapsed)  Avg pause  Max pause
      Gen  0      1418 colls,     0 par    5.010s   5.013s     0.0035s    0.0926s
      Gen  1        18 colls,     0 par    1.421s   1.422s     0.0790s    0.2721s
    
      TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
    
      SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
    
      INIT    time    0.000s  (  0.000s elapsed)
      MUT     time   28.604s  ( 33.689s elapsed)
      GC      time    6.431s  (  6.435s elapsed)
      EXIT    time    0.001s  (  0.006s elapsed)
      Total   time   35.036s  ( 40.130s elapsed)
    
      Alloc rate    1,777,950,299 bytes per MUT second
    
      Productivity  81.6% of total user, 83.9% of total elapsed

    with short-bytestring

      51,056,298,936 bytes allocated in the heap
       4,524,863,880 bytes copied during GC
         305,863,448 bytes maximum residency (21 sample(s))
           2,155,752 bytes maximum slop
                 847 MiB total memory in use (0 MB lost due to fragmentation)
    
                                         Tot time (elapsed)  Avg pause  Max pause
      Gen  0      1699 colls,     0 par    5.395s   5.397s     0.0032s    0.1066s
      Gen  1        21 colls,     0 par    1.620s   1.620s     0.0772s    0.2923s
    
      TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
    
      SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
    
      INIT    time    0.001s  (  0.000s elapsed)
      MUT     time   28.493s  ( 33.571s elapsed)
      GC      time    7.015s  (  7.017s elapsed)
      EXIT    time    0.001s  (  0.001s elapsed)
      Total   time   35.508s  ( 40.590s elapsed)
    
      Alloc rate    1,791,918,128 bytes per MUT second
    
      Productivity  80.2% of total user, 82.7% of total elapsed

    The "bytes allocated" metric also increases here, however with short-bytestring I have significantly less maximum residency 305M vs 378M (this is consistent across runs) so that's good.

    One other thing that stands out is with short-bytestring I have quite a bit more "bytes copied during GC" which kind of makes sense, all the FastString's data used to pinned and now isn't so there would be more potential for them to be copied around during GC. That makes me think, is this extra stuff being copied around also counting towards "bytes allocated"? As in you'd have to allocate the destination of the GC copy so that would increase the ammount of allocations in my mental model.

    Does that sufficiently explain the stat increase or am I way off?

    Edited by Daniel Gröber (dxld)
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading