setThreadAffinity assumes a certain CPU virtual core layout
RTS -qa option that can set thread affinity was implemented in https://git.haskell.org/ghc.git/commitdiff/31caec794c3978d55d79f715f21fb72948c9f300
// Schedules the thread to run on CPU n of m. m may be less than the // number of physical CPUs, in which case, the thread will be allowed // to run on CPU n, n+m, n+2m etc. void setThreadAffinity (nat n, nat m)
Today I discovered that on some machines, this option helps parallel performance (e.g.
+RTS -N4) a lot, while on others it doesn't.
Together with thomie on #ghc, I found out the reason:
Lets assume I have 4 real cores with hyperthreading, so 8 virtual cores.
The mapping of hyperthreading cores to physical cores is different across machines.
On my one machine (Intel i5), the layout is 11223344, meaning that the first two vCPUs (hyperthreads) that the OS announces (visible e.g. in HTOP) map to the first physical core in the system, and so on.
On my other machine (Intel Xeon), the layout is 12341234; here the 1st and the 5th vCPU map to the same physical core.
This layout can be (on Linux) observed by running:
cat /proc/cpuinfo|egrep "processor|physical id|core id" |sed 's/^processor/\nprocessor/g'
I do not know whether this layout is dictated by the processor, chosen by the OS, or even changing across reboots; what is clear is that the layout can vary across machines.
Now, as explained by thomie:
-qa will set your 4 capabilities to cores [(1,5), (2,6), (3,7), (4,8)], and then the os randomly chooses out of those tuples
This strategy is optimal for the 12341234 layout; for example, when running with -N4, it ensures that two threads are not scheduled onto vCPUs that are on the same physical core. The possible
+RTS -aq choice
1__4_23_ is a great assignment in this case, as is
_ means the vCPU is not chosen).
But for the 11223344, the choice
1234____ isn't good, because it uses only 2 of our 4 physical cores; our program now takes twice as long to run.
It seems likely to me that
setThreadAffinity was written on a machine with 12341234 layout, and with the assumption that all machines have this layout.
It would be great if we could change it to take the actual layout into account.
|CC||nh2, simonmar, thomie|