• Simon Marlow's avatar
    NUMA support · 9e5ea67e
    Simon Marlow authored
    The aim here is to reduce the number of remote memory accesses on
    systems with a NUMA memory architecture, typically multi-socket servers.
    Linux provides a NUMA API for doing two things:
    * Allocating memory local to a particular node
    * Binding a thread to a particular node
    When given the +RTS --numa flag, the runtime will
    * Determine the number of NUMA nodes (N) by querying the OS
    * Assign capabilities to nodes, so cap C is on node C%N
    * Bind worker threads on a capability to the correct node
    * Keep a separate free lists in the block layer for each node
    * Allocate the nursery for a capability from node-local memory
    * Allocate blocks in the GC from node-local memory
    For example, using nofib/parallel/queens on a 24-core 2-socket machine:
    $ ./Main 15 +RTS -N24 -s -A64m
      Total   time  173.960s  (  7.467s elapsed)
    $ ./Main 15 +RTS -N24 -s -A64m --numa
      Total   time  150.836s  (  6.423s elapsed)
    The biggest win here is expected to be allocating from node-local
    memory, so that means programs using a large -A value (as here).
    According to perf, on this program the number of remote memory accesses
    were reduced by more than 50% by using `--numa`.
    Test Plan:
    * validate
    * There's a new flag --debug-numa=<n> that pretends to do NUMA without
      actually making the OS calls, which is useful for testing the code
      on non-NUMA systems.
    * TODO: I need to add some unit tests
    Reviewers: erikd, austin, rwbarton, ezyang, bgamari, hvr, niteria
    Subscribers: thomie
    Differential Revision: https://phabricator.haskell.org/D2199