Skip to content

GHC GC rises greatly on -N8 compared to -N7

I've recently been benchmarking some parallel code on an 8-core x86 machine. I have graphed performance for -N1 through to -N8, but I often get a huge spike on -N8. I can produce this fairly reliably with the attached program, that simply spawns 8 threads that each spin writing to a TVar. Here is an example of the timings I get:

time ./Main +RTS -N1
real	0m43.697s
user	0m43.683s
sys	0m0.016s

time ./Main +RTS -N4
real	0m24.424s
user	1m13.621s
sys	0m0.208s

time ./Main +RTS -N7
real	0m18.631s
user	1m42.778s
sys	0m0.260s

time ./Main +RTS -N8
real	0m37.859s
user	3m38.506s
sys	0m0.632s

Spot the odd one out! It's not to do with having exactly 8 threads for 8 cores -- I get this problem with ~150 threads on 8 cores. When I run with -s, it seems like the GC is the reason:

time ./Main +RTS -N7 -s 
   7,313,502,412 bytes allocated in the heap
     108,812,664 bytes copied during GC
          11,924 bytes maximum residency (100 sample(s))
          61,828 bytes maximum slop
               5 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:  4207 collections,     0 parallel,  1.18s,  1.17s elapsed
  Generation 1:   100 collections,    99 parallel,  0.18s,  0.04s elapsed

  Parallel GC work balance: 3.25 (285256 / 87661, ideal 7)

  Task  0 (worker) :  MUT time:   5.56s  ( 11.70s elapsed)
                      GC  time:   0.01s  (  0.01s elapsed)

  Task  1 (worker) :  MUT time:   6.65s  ( 11.70s elapsed)
                      GC  time:   1.16s  (  1.13s elapsed)

  Task  2 (worker) :  MUT time:   0.03s  ( 11.70s elapsed)
                      GC  time:   0.00s  (  0.00s elapsed)

  Task  3 (worker) :  MUT time:   0.00s  ( 11.70s elapsed)
                      GC  time:   0.00s  (  0.00s elapsed)

  Task  4 (worker) :  MUT time:   9.93s  ( 11.70s elapsed)
                      GC  time:   0.01s  (  0.01s elapsed)

  Task  5 (worker) :  MUT time:   6.97s  ( 11.70s elapsed)
                      GC  time:   0.00s  (  0.00s elapsed)

  Task  6 (worker) :  MUT time:   8.25s  ( 11.70s elapsed)
                      GC  time:   0.04s  (  0.02s elapsed)

  Task  7 (worker) :  MUT time:   6.79s  ( 11.72s elapsed)
                      GC  time:   0.00s  (  0.00s elapsed)

  Task  8 (worker) :  MUT time:   3.70s  ( 11.70s elapsed)
                      GC  time:   0.14s  (  0.03s elapsed)

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time   61.91s  ( 11.70s elapsed)
  GC    time    1.35s  (  1.21s elapsed)
  EXIT  time    0.01s  (  0.02s elapsed)
  Total time   63.27s  ( 12.93s elapsed)

  %GC time       2.1%  (9.4% elapsed)

  Alloc rate    118,112,344 bytes per MUT second

  Productivity  97.9% of total user, 478.9% of total elapsed

recordMutableGen_sync: 383
gc_alloc_block_sync: 2
whitehole_spin: 0
gen[0].steps[0].sync_todo: 0
gen[0].steps[0].sync_large_objects: 0
gen[0].steps[1].sync_todo: 16
gen[0].steps[1].sync_large_objects: 0
gen[1].steps[0].sync_todo: 175
gen[1].steps[0].sync_large_objects: 0

real	0m12.933s
user	1m3.276s
sys	0m0.248s
time ./Main +RTS -N8 -s 
   7,316,581,044 bytes allocated in the heap
     153,424,584 bytes copied during GC
          11,964 bytes maximum residency (132 sample(s))
          65,916 bytes maximum slop
               6 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:  4711 collections,  4711 parallel, 136.08s, 21.38s elapsed
  Generation 1:   132 collections,   132 parallel,  6.12s,  0.93s elapsed

  Parallel GC work balance: 1.01 (38276826 / 38012855, ideal 8)

  Task  0 (worker) :  MUT time:   0.00s  ( 14.56s elapsed)
                      GC  time:  30.31s  (  4.74s elapsed)

  Task  1 (worker) :  MUT time:   0.00s  ( 14.56s elapsed)
                      GC  time:  25.09s  (  3.68s elapsed)

  Task  2 (worker) :  MUT time:   0.00s  ( 14.56s elapsed)
                      GC  time:   0.09s  (  0.01s elapsed)

  Task  3 (worker) :  MUT time:   0.00s  ( 14.56s elapsed)
                      GC  time:   0.00s  (  0.00s elapsed)

  Task  4 (worker) :  MUT time:   0.98s  ( 14.56s elapsed)
                      GC  time:   7.62s  (  1.17s elapsed)

  Task  5 (worker) :  MUT time:   0.00s  ( 14.56s elapsed)
                      GC  time:  13.09s  (  2.01s elapsed)

  Task  6 (worker) :  MUT time:   5.81s  ( 14.56s elapsed)
                      GC  time:   1.92s  (  0.28s elapsed)

  Task  7 (worker) :  MUT time:   0.00s  ( 14.56s elapsed)
                      GC  time:  25.12s  (  4.24s elapsed)

  Task  8 (worker) :  MUT time:   0.95s  ( 14.56s elapsed)
                      GC  time:   8.01s  (  1.20s elapsed)

  Task  9 (worker) :  MUT time:   0.00s  ( 14.56s elapsed)
                      GC  time:  30.94s  (  4.98s elapsed)

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time   65.37s  ( 14.56s elapsed)
  GC    time  142.20s  ( 22.31s elapsed)
  EXIT  time    0.00s  (  0.01s elapsed)
  Total time  207.57s  ( 36.87s elapsed)

  %GC time      68.5%  (60.5% elapsed)

  Alloc rate    111,922,106 bytes per MUT second

  Productivity  31.5% of total user, 177.3% of total elapsed

recordMutableGen_sync: 597
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].steps[0].sync_todo: 0
gen[0].steps[0].sync_large_objects: 0
gen[0].steps[1].sync_todo: 35296
gen[0].steps[1].sync_large_objects: 0
gen[1].steps[0].sync_todo: 557515
gen[1].steps[0].sync_large_objects: 0

real	0m36.873s
user	3m27.569s
sys	0m0.524s

See how the GC goes from 1s to 142s, just by virtue of -N8 instead of -N7. Increasing the heap size seems to alleviate the problem, but the difference is so severe (and unexpected, at least to me) that I thought I'd file a ticket. I am happy to try other flags if you want. This is all on GHC 6.10.4.

Trac metadata
Trac field Value
Version 6.10.4
Type Bug
TypeOfFailure OtherFailure
Priority normal
Resolution Unresolved
Component Runtime System
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system
Architecture
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information