Skip to content
  • Simon Marlow's avatar
    NUMA support · 9e5ea67e
    Simon Marlow authored
    Summary:
    The aim here is to reduce the number of remote memory accesses on
    systems with a NUMA memory architecture, typically multi-socket servers.
    
    Linux provides a NUMA API for doing two things:
    * Allocating memory local to a particular node
    * Binding a thread to a particular node
    
    When given the +RTS --numa flag, the runtime will
    * Determine the number of NUMA nodes (N) by querying the OS
    * Assign capabilities to nodes, so cap C is on node C%N
    * Bind worker threads on a capability to the correct node
    * Keep a separate free lists in the block layer for each node
    * Allocate the nursery for a capability from node-local memory
    * Allocate blocks in the GC from node-local memory
    
    For example, using nofib/parallel/queens on a 24-core 2-socket machine:
    
    ```
    $ ./Main 15 +RTS -N24 -s -A64m
      Total   time  173.960s  (  7.467s elapsed)
    
    $ ./Main 15 +RTS -N24 -s -A64m --numa
      Total   time  150.836s  (  6.423s elapsed)
    ```
    
    The biggest win here is expected to be allocating from node-local
    memory, so th...
    9e5ea67e