rts/Inlines.c · 9e5ea67e268be2659cd30ebaed7044d298198ab0 · Glasgow Haskell Compiler / GHC

Simon Marlow authored Apr 23, 2016
Summary:
The aim here is to reduce the number of remote memory accesses on
systems with a NUMA memory architecture, typically multi-socket servers.

Linux provides a NUMA API for doing two things:
* Allocating memory local to a particular node
* Binding a thread to a particular node

When given the +RTS --numa flag, the runtime will
* Determine the number of NUMA nodes (N) by querying the OS
* Assign capabilities to nodes, so cap C is on node C%N
* Bind worker threads on a capability to the correct node
* Keep a separate free lists in the block layer for each node
* Allocate the nursery for a capability from node-local memory
* Allocate blocks in the GC from node-local memory

For example, using nofib/parallel/queens on a 24-core 2-socket machine:

```
$ ./Main 15 +RTS -N24 -s -A64m
  Total   time  173.960s  (  7.467s elapsed)

$ ./Main 15 +RTS -N24 -s -A64m --numa
  Total   time  150.836s  (  6.423s elapsed)
```

The biggest win here is expected to be allocating from node-local
memory, so that means programs using a large -A value (as here).

According to perf, on this program the number of remote memory accesses
were reduced by more than 50% by using `--numa`.

Test Plan:
* validate
* There's a new flag --debug-numa=<n> that pretends to do NUMA without
  actually making the OS calls, which is useful for testing the code
  on non-NUMA systems.
* TODO: I need to add some unit tests

Reviewers: erikd, austin, rwbarton, ezyang, bgamari, hvr, niteria

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D2199
9e5ea67e