`+RTS --numa` can significantly degrade mutator performance
While validating some training materials I noticed that +RTS --numa
can often hurt performance significantly more than it helps. For instance, using my https://gitlab.haskell.org/bgamari/raytrace workload on GHC 9.8.1 running on an AMD 2990WX (32 cores, 64 threads, ), I find the following:
$ ray +RTS -s -N16 --numa
Rays: 16777216
Figures: 10559
Sampling...
1,179,032,039,144 bytes allocated in the heap
438,388,616 bytes copied during GC
21,102,328 bytes maximum residency (11 sample(s))
1,413,384 bytes maximum slop
141 MiB total memory in use (24 MiB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 21674 colls, 21674 par 17.358s 2.010s 0.0001s 0.0022s
Gen 1 11 colls, 10 par 0.096s 0.019s 0.0018s 0.0037s
Parallel GC work balance: 63.91% (serial 0%, perfect 100%)
TASKS: 34 (1 bound, 33 peak workers (33 total), using -N16)
SPARKS: 256 (256 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.007s ( 0.003s elapsed)
MUT time 642.288s ( 41.830s elapsed)
GC time 17.454s ( 2.029s elapsed)
EXIT time 0.012s ( 0.002s elapsed)
Total time 659.761s ( 43.864s elapsed)
Alloc rate 1,835,673,759 bytes per MUT second
Productivity 97.4% of total user, 95.4% of total elapsed
$ ray +RTS -s -N16
Rays: 16777216
Figures: 10559
Sampling...
1,179,032,005,208 bytes allocated in the heap
450,982,840 bytes copied during GC
20,707,696 bytes maximum residency (11 sample(s))
1,382,032 bytes maximum slop
140 MiB total memory in use (21 MiB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 22545 colls, 22545 par 18.835s 2.214s 0.0001s 0.0015s
Gen 1 11 colls, 10 par 0.109s 0.020s 0.0018s 0.0036s
Parallel GC work balance: 63.71% (serial 0%, perfect 100%)
TASKS: 34 (1 bound, 33 peak workers (33 total), using -N16)
SPARKS: 256 (256 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.007s ( 0.003s elapsed)
MUT time 302.602s ( 20.498s elapsed)
GC time 18.944s ( 2.234s elapsed)
EXIT time 0.015s ( 0.002s elapsed)
Total time 321.567s ( 22.737s elapsed)
Alloc rate 3,896,317,999 bytes per MUT second
Productivity 94.1% of total user, 90.2% of total elapsed
So, while GC is helped slightly, mutator time is regresses by nearly half.
This machine's 64 GBytes of physical memory is spread across four DIMMs and evenly split across the machine's two memory controllers:
$ sudo numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
node 0 size: 32081 MB
node 0 free: 19003 MB
node 1 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
node 2 size: 32190 MB
node 2 free: 11339 MB
node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
node 3 size: 0 MB
node 3 free: 0 MB
node distances:
node 0 1 2 3
0: 10 16 16 16
1: 16 10 16 16
2: 16 16 10 16
3: 16 16 16 10