Skip to content

`+RTS --numa` can significantly degrade mutator performance

While validating some training materials I noticed that +RTS --numa can often hurt performance significantly more than it helps. For instance, using my https://gitlab.haskell.org/bgamari/raytrace workload on GHC 9.8.1 running on an AMD 2990WX (32 cores, 64 threads, ), I find the following:

$ ray +RTS -s -N16 --numa
Rays: 16777216
Figures: 10559
Sampling...
1,179,032,039,144 bytes allocated in the heap
     438,388,616 bytes copied during GC
      21,102,328 bytes maximum residency (11 sample(s))
       1,413,384 bytes maximum slop
             141 MiB total memory in use (24 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     21674 colls, 21674 par   17.358s   2.010s     0.0001s    0.0022s
  Gen  1        11 colls,    10 par    0.096s   0.019s     0.0018s    0.0037s

  Parallel GC work balance: 63.91% (serial 0%, perfect 100%)

  TASKS: 34 (1 bound, 33 peak workers (33 total), using -N16)

  SPARKS: 256 (256 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.007s  (  0.003s elapsed)
  MUT     time  642.288s  ( 41.830s elapsed)
  GC      time   17.454s  (  2.029s elapsed)
  EXIT    time    0.012s  (  0.002s elapsed)
  Total   time  659.761s  ( 43.864s elapsed)

  Alloc rate    1,835,673,759 bytes per MUT second

  Productivity  97.4% of total user, 95.4% of total elapsed

$ ray +RTS -s -N16
Rays: 16777216
Figures: 10559
Sampling...
1,179,032,005,208 bytes allocated in the heap
     450,982,840 bytes copied during GC
      20,707,696 bytes maximum residency (11 sample(s))
       1,382,032 bytes maximum slop
             140 MiB total memory in use (21 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     22545 colls, 22545 par   18.835s   2.214s     0.0001s    0.0015s
  Gen  1        11 colls,    10 par    0.109s   0.020s     0.0018s    0.0036s

  Parallel GC work balance: 63.71% (serial 0%, perfect 100%)

  TASKS: 34 (1 bound, 33 peak workers (33 total), using -N16)

  SPARKS: 256 (256 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.007s  (  0.003s elapsed)
  MUT     time  302.602s  ( 20.498s elapsed)
  GC      time   18.944s  (  2.234s elapsed)
  EXIT    time    0.015s  (  0.002s elapsed)
  Total   time  321.567s  ( 22.737s elapsed)

  Alloc rate    3,896,317,999 bytes per MUT second

  Productivity  94.1% of total user, 90.2% of total elapsed

So, while GC is helped slightly, mutator time is regresses by nearly half.

This machine's 64 GBytes of physical memory is spread across four DIMMs and evenly split across the machine's two memory controllers:

$ sudo numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
node 0 size: 32081 MB
node 0 free: 19003 MB
node 1 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
node 2 size: 32190 MB
node 2 free: 11339 MB
node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node   0   1   2   3 
  0:  10  16  16  16 
  1:  16  10  16  16 
  2:  16  16  10  16 
  3:  16  16  16  10 
Edited by Ben Gamari
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information