This is an issue that came to light when testing the patches on #910 (closed). You can see some of the numbers there. Basically, recent GHC HEAD builds, running a large number of threads blocked on MVars will result in burning a lot of system time.
The attached file provides a mediocre reproducer. With it, you can see that building with HEAD as of Aug 31st and running with -RTS -N32 will result in around 200ms system time, whereas GHC 7.6.3 spends about 30ms in system time. This shows the disparity, but the result is not that egregious.
A more noticeable example is on ticket #910 (closed), where when running on 31 threads, there is an 8 minutes of system time for 17 minutes of user time, and yet at one thread that system time drops to under two seconds!
1 thread: real 1m20.028s user 1m17.921s sys 0m1.768s 31 threads: real 1m27.445s user 17m0.314s sys 8m0.175s
It needs to be determined whether this system time is a result of the parallel compilation patches on #910 (closed), or whether it is a new problem with the runtime system, and in particular with the parallel IO manager. I am inclined to believe that compiling in parallel involves extra IO (repeatedly reading interface files?), but not eight minutes of it!!
I also see the 200ms system time for excessive_system.hs when I run it with -N32 (and compiled with recent HEAD). Below is the first few entries after running perf record; it looks like it may be GC related. I don't see any IO manager related entries in the list.
I haven't tested this (don't have a 32 core machine,) but if the gcWorkerThread is causing problems, what does it look like when you turn off the multicore collector and load balancer? i.e. run with:
Even with MVars removed and each thread simply waiting (see attached program) it takes 200ms system time with -N32. Here is a perf recording with call stack info; here is most of the first entry:
and I get 200ms in kernel when run with -N32. I'm not sure how significant this 200ms time is - I guess it is time spent initializing or cleaning up capabilities (possibly including IO manager resources).
Right, the 200ms isn't really bad. Can anyone reproduce the 8 *minutes* of system time I was seeing on the "ghc-parmake-gsoc" branch? I'd like to figure out whether that problem has anything to do with the patches on #910 (closed), or is unrelated.
Was there a switch from some other kind of lock to a spin lock at some point?
200ms spent doing unspecified "initialization" should definitely be investigated, that's a lot. I can't see anything useful in that call stack, it looks like it might be related to signals or epoll() but it's hard to tell anything else.
One reason for extra user/system time that could account for your results in #910 (closed) is that if there isn't enough parallelism during GC then you'll have 30 threads spinning looking for work, and doing the occasional sched_yield(). Spinning accounts for the user time, the sched_yield() accounts for the system time. Still, it does look like a lot, so it would be good to look at some !ThreadScope traces.
That doesn't account for the 200ms in the empty program though, so there could be something else going on here.
I'm not sure about the 8 minutes problem, but the 200ms system time for the empty program does indeed seem to be related to the parallel IO manager changes. I just modified the IO manager to start only one epoll instance and I get about 50 ms system time, whereas with HEAD I get about 200ms still. I will upload the patch to make the IO manager use just one poller in case anyone else wants to try it.
200ms spent doing unspecified "initialization" should definitely be investigated, that's a lot. I can't see anything useful in that call stack, it looks like it might be related to signals or epoll() but it's hard to tell anything else.
543,096 bytes allocated in the heap 64 bytes copied during GC 412,256 bytes maximum residency (1 sample(s)) 279,968 bytes maximum slop 18 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.00s 0.00s 0.0009s 0.0009s Parallel GC work balance: -nan% (serial 0%, perfect 100%) TASKS: 65 (1 bound, 64 peak workers (64 total), using -N32) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.04s ( 0.03s elapsed) MUT time 0.00s ( -0.00s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.01s ( 0.00s elapsed) Total time 0.05s ( 0.04s elapsed) Alloc rate 0 bytes per MUT second Productivity 17.4% of total user, 24.7% of total elapsedgc_alloc_block_sync: 0whitehole_spin: 0gen[0].sync: 0gen[1].sync: 0