Excessive system time -- new IO manager problem?

changed milestone to %7.8.3

changed weight to 7

Attached file excessive_system.hs ($900).

Simple program with many MVar-blocked threads.

I also see the 200ms system time for excessive_system.hs when I run it with -N32 (and compiled with recent HEAD). Below is the first few entries after running perf record; it looks like it may be GC related. I don't see any IO manager related entries in the list.

39.02% excessive_syste excessive_system [.] gcWorkerThread 15.53% excessive_syste [kernel.kallsyms] [k] __ticket_spin_lock 6.10% excessive_syste [kernel.kallsyms] [k] default_send_IPI_mask_sequence_phys 5.88% excessive_syste excessive_system [.] allocBlock_sync 3.95% excessive_syste [kernel.kallsyms] [k] update_sd_lb_stats 2.76% excessive_syste excessive_system [.] steal_todo_block 2.18% excessive_syste [kernel.kallsyms] [k] native_write_msr_safe 1.65% excessive_syste libc-2.15.so [.] 0x7e21f 1.52% excessive_syste excessive_system [.] evacuate 0.75% excessive_syste [kernel.kallsyms] [k] dequeue_task_fair 0.74% excessive_syste [kernel.kallsyms] [k] do_mmap_pgoff 0.67% excessive_syste libc-2.15.so [.] __clone 0.67% excessive_syste [kernel.kallsyms] [k] memset 0.66% excessive_syste [kernel.kallsyms] [k] dequeue_entity 0.66% excessive_syste [kernel.kallsyms] [k] release_pages 0.62% excessive_syste [kernel.kallsyms] [k] find_next_bit 0.58% excessive_syste excessive_system [.] scavenge_until_all_done 0.56% excessive_syste [kernel.kallsyms] [k] __schedule 0.54% excessive_syste [kernel.kallsyms] [k] hrtimer_interrupt

I haven't tested this (don't have a 32 core machine,) but if the gcWorkerThread is causing problems, what does it look like when you turn off the multicore collector and load balancer? i.e. run with:

$ ./excessive_system +RTS -N -qb -qg

Good idea. I still see about the same amount of system time. The __ticket_spin_lock is still at the top:

 Events: 1K cycles
#
# Overhead          Command       Shared Object                               Symbol
# ........  ...............  ..................  ...................................
#
    34.41%  excessive_syste  [kernel.kallsyms]   [k] __ticket_spin_lock
    11.94%  excessive_syste  [kernel.kallsyms]   [k] default_send_IPI_mask_sequence_phys
     4.48%  excessive_syste  [kernel.kallsyms]   [k] update_sd_lb_stats
     3.39%  excessive_syste  [kernel.kallsyms]   [k] native_write_msr_safe
     1.86%  excessive_syste  [kernel.kallsyms]   [k] __slab_free
     1.85%  excessive_syste  [kernel.kallsyms]   [k] idle_cpu
     1.55%  excessive_syste  libc-2.15.so        [.] 0x7e175         
     1.39%  excessive_syste  [kernel.kallsyms]   [k] perf_event_alloc
     1.36%  excessive_syste  [kernel.kallsyms]   [k] dequeue_task_fair
     1.21%           :47802  [kernel.kallsyms]   [k] page_remove_rmap
     1.09%  excessive_syste  [kernel.kallsyms]   [k] mmap_region
     0.99%  excessive_syste  [kernel.kallsyms]   [k] __init_waitqueue_head
     0.94%  excessive_syste  [kernel.kallsyms]   [k] memset
     0.92%  excessive_syste  [kernel.kallsyms]   [k] do_futex
     0.90%  excessive_syste  [kernel.kallsyms]   [k] perf_event_mmap_ctx
     0.89%  excessive_syste  [kernel.kallsyms]   [k] kmem_cache_alloc_trace
     0.87%  excessive_syste  libpthread-2.15.so  [.] __deallocate_stack
     0.86%  excessive_syste  excessive_system    [.] waitCondition
     0.86%  excessive_syste  [kernel.kallsyms]   [k] __rwsem_do_wake
     0.85%  excessive_syste  [kernel.kallsyms]   [k] get_page_from_freelist
     0.84%  excessive_syste  [kernel.kallsyms]   [k] find_vma
     0.84%  excessive_syste  excessive_system    [.] s6t3_info
     0.83%  excessive_syste  [kernel.kallsyms]   [k] sys_read
     0.82%  excessive_syste  [kernel.kallsyms]   [k] kfree
     0.78%  excessive_syste  [kernel.kallsyms]   [k] __raw_write_unlock_irq.constprop.31

Even with MVars removed and each thread simply waiting (see attached program) it takes 200ms system time with -N32. Here is a perf recording with call stack info; here is most of the first entry:

# Events: 1K cycles
#
# Overhead          Command       Shared Object                               Symbol
# ........  ...............  ..................  ...................................
#
    51.37%  excessive_syste  [kernel.kallsyms]   [k] __ticket_spin_lock
            |
            --- __ticket_spin_lock
               |          
               |--88.44%-- _raw_spin_lock_irq
               |          |          
               |          |--89.17%-- get_signal_to_deliver
               |          |          do_signal
               |          |          do_notify_resume
               |          |          int_signal
               |          |          |          
               |          |          |--94.68%-- __lll_lock_wait_private
               |          |          |          
               |          |           --5.32%-- epoll_wait
               |          |                     |          
               |          |                     |--67.64%-- 0x7ff06806d533
               |          |                     |          0x8758b4830558b48
               |          |                     |          
               |          |                     |--18.03%-- 0x7ff06806690b
               |          |                     |          0x8758b4830558b48
               |          |                     |          
               |          |                      --14.33%-- 0x7ff068062b2b
               |          |                                0x8758b4830558b48
               |          |          
               |          |--8.08%-- acct_collect
               |          |          do_exit
               |          |          do_group_exit
               |          |          get_signal_to_deliver
               |          |          do_signal
               |          |          do_notify_resume
               |          |          int_signal
               |          |          __lll_lock_wait_private
               |          |

Attached file excessive_system_2.hs ($901).

Simplified version

Actually, I can simplify the program to just

main = return ()

and I get 200ms in kernel when run with -N32. I'm not sure how significant this 200ms time is - I guess it is time spent initializing or cleaning up capabilities (possibly including IO manager resources).

Thanks for the perf data!

Right, the 200ms isn't really bad. Can anyone reproduce the 8 *minutes* of system time I was seeing on the "ghc-parmake-gsoc" branch? I'd like to figure out whether that problem has anything to do with the patches on #910 (closed), or is unrelated.

Was there a switch from some other kind of lock to a spin lock at some point?

200ms spent doing unspecified "initialization" should definitely be investigated, that's a lot. I can't see anything useful in that call stack, it looks like it might be related to signals or epoll() but it's hard to tell anything else.

What does the +RTS -s look like?

One reason for extra user/system time that could account for your results in #910 (closed) is that if there isn't enough parallelism during GC then you'll have 30 threads spinning looking for work, and doing the occasional sched_yield(). Spinning accounts for the user time, the sched_yield() accounts for the system time. Still, it does look like a lot, so it would be good to look at some !ThreadScope traces.

That doesn't account for the 200ms in the empty program though, so there could be something else going on here.

Trac metadata

Trac field	Value
CC	andreas.voellmy@gmail.com, rrnewton@gmail.com → alpmestan@gmail.com, andreas.voellmy@gmail.com, rrnewton@gmail.com

Trac metadata

Trac field	Value
CC	alpmestan@gmail.com, andreas.voellmy@gmail.com, rrnewton@gmail.com → alpmestan@gmail.com, andreas.voellmy@gmail.com, kazu@iij.ad.jp, rrnewton@gmail.com, simonmar

I'm not sure about the 8 minutes problem, but the 200ms system time for the empty program does indeed seem to be related to the parallel IO manager changes. I just modified the IO manager to start only one epoll instance and I get about 50 ms system time, whereas with HEAD I get about 200ms still. I will upload the patch to make the IO manager use just one poller in case anyone else wants to try it.

Attached file one-poller.patch (download).

Hack to use just one poller in the IO manager.

Replying to [ticket:8224#comment:75150 simonmar]:

200ms spent doing unspecified "initialization" should definitely be investigated, that's a lot. I can't see anything useful in that call stack, it looks like it might be related to signals or epoll() but it's hard to tell anything else.

What does the +RTS -s look like?

With my "one poller" hack:

          48,624 bytes allocated in the heap
              40 bytes copied during GC
          38,592 bytes maximum residency (1 sample(s))
         268,608 bytes maximum slop
              32 MB total memory in use (15 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.00s    0.00s     0.0003s    0.0003s

  Parallel GC work balance: -nan (0 / 0, ideal 32)

                        MUT time (elapsed)       GC time  (elapsed)
  Task  0 (worker) :    0.00s    (  0.00s)       0.00s    (  0.00s)
  Task  1 (worker) :    0.00s    (  0.00s)       0.00s    (  0.00s)
  Task  2 (bound)  :    0.00s    (  0.00s)       0.00s    (  0.00s)
  Task  3 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task  4 (worker) :    0.02s    (  0.01s)       0.00s    (  0.00s)
  Task  5 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task  6 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task  7 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task  8 (worker) :    0.02s    (  0.01s)       0.00s    (  0.00s)
  Task  9 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 10 (worker) :    0.02s    (  0.01s)       0.00s    (  0.00s)
  Task 11 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 12 (worker) :    0.02s    (  0.01s)       0.00s    (  0.00s)
  Task 13 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 14 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 15 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 16 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 17 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 18 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 19 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 20 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 21 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 22 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 23 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 24 (worker) :    0.00s    (  0.01s)       0.00s    (  0.00s)
  Task 25 (worker) :    0.00s    (  0.01s)       0.00s    (  0.00s)
  Task 26 (worker) :    0.00s    (  0.01s)       0.00s    (  0.00s)
  Task 27 (worker) :    0.00s    (  0.01s)       0.00s    (  0.00s)
  Task 28 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 29 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 30 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 31 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 32 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 33 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.00s  (  0.01s elapsed)
  MUT     time    0.00s  (  0.00s elapsed)
  GC      time    0.00s  (  0.00s elapsed)
  EXIT    time    0.01s  (  0.00s elapsed)
  Total   time    0.02s  (  0.01s elapsed)

  Alloc rate    2,431,200 bytes per MUT second

  Productivity  80.0% of total user, 132.3% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

With HEAD:

         543,096 bytes allocated in the heap
              64 bytes copied during GC
         412,256 bytes maximum residency (1 sample(s))
         279,968 bytes maximum slop
              18 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.00s    0.00s     0.0009s    0.0009s

  Parallel GC work balance: -nan% (serial 0%, perfect 100%)

  TASKS: 65 (1 bound, 64 peak workers (64 total), using -N32)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.04s  (  0.03s elapsed)
  MUT     time    0.00s  ( -0.00s elapsed)
  GC      time    0.00s  (  0.00s elapsed)
  EXIT    time    0.01s  (  0.00s elapsed)
  Total   time    0.05s  (  0.04s elapsed)

  Alloc rate    0 bytes per MUT second

  Productivity  17.4% of total user, 24.7% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

I did some more experiments, and I now think it has to do with making "safe" foreign calls from all of the 32 threads. I modified the IO manager so that all of the polling threads (and their epoll backends) are created. Only one of the them is used (on cap 0) is used to register files. I then modified the poll loops so that only the one on cap 0 is running the normal polling loop and the others simply return right away. With this setup, the performance is good (about 40-50 ms for -N32). However, if I have all poll loops (except the one normal poll loop for cap 0) simply make a "safe" call to usleep(10) then the performance goes back to about 200ms. I'll post a patch here so that others can see.

So it looks to be related to the cost of doing safe foreign calls. With the parallel IO manager each of the N pollers will make a safe foreign call when there is no work to do. I'm not sure that this will have much of an impact on most programs, but for the empty program it represents a large part of the work.

I'm also not sure this is related to the 8 minute system time mentioned in the original report. rnewton, can you provide a recipe for how to recreate this?

Is the 200 ms startup time for 32 cores an issue? At 16 cores it is about 50-60ms and at 8 cores about 10-20ms.

One possible resolution (besides doing nothing or making safe foreign calls cheaper) would be to provide an RTS option to use a specific number of IO managers, or at least to differentiate between using 1 or using N.

Attached file usleep.patch (download).

Cap 0 poll loop is normal, all other poll loops call usleep(10) then exit.

The first safe call on each Capability will cause the creation of a new OS thread, which probably accounts for some of the cost. From the strace profile it looks like there is some contention too. It might be useful to do some mutex profiling to see what lock is being contended for - perhaps it is all_tasks_mutex.

I ran with mutrace and did see any obvious explanations.

Is it possible that by having more safe foreign calls RTS threads spend more time in some spin locks? If so, which spin locks would those be?

One interesting thing I just noticed is that taskset makes a big difference. excessive_system_2_HEAD is the empty program compiled with HEAD and excessive_system_2_ONEACTIVELOOP is the empty program compiled with only one IO manager loop being started. We see that for excessive_system_2_HEAD if we use taskset to restrict the threads to run on CPUs 0-9 (all on one NUMA node) we get a big difference:

$ time ./excessive_system_2_HEAD +RTS -N32 

real	0m0.046s
user	0m0.020s
sys	0m0.188s

$ time taskset -c 0-9 ./excessive_system_2_HEAD +RTS -N32 

real	0m0.025s
user	0m0.000s
sys	0m0.056s

excessive_system_2_ONEACTIVELOOP also performs better with taskset but the difference is much smaller. Without taskset I see system tim from 30ms-60ms and with taskset it is about 10-20ms.

So maybe the explanation is that with 31 extra safe foreign calls (from calls to epoll_wait) we get about 31 extra OS threads and if we are unlucky and the OS schedules these on different NUMA nodes we will get cache issues slowing down. And finally a spin lock burns longer accounting for more system time?

Here is a run of excessive_system_2_HEAD using taskset to restrict to first NUMA node or no restrictions with perf stat:

$ perf_3.2.0-25 stat taskset -c 0-9 ./excessive_system_2_HEAD +RTS -N32 

 Performance counter stats for 'taskset -c 0-9 ./excessive_system_2_HEAD +RTS -N32':

         25.019318 task-clock                #    0.900 CPUs utilized          
               584 context-switches          #    0.023 M/sec                  
               218 CPU-migrations            #    0.009 M/sec                  
             1,248 page-faults               #    0.050 M/sec                  
        36,387,323 cycles                    #    1.454 GHz                     [62.17%]
        26,057,951 stalled-cycles-frontend   #   71.61% frontend cycles idle    [98.77%]
        17,556,988 stalled-cycles-backend    #   48.25% backend  cycles idle   
        22,358,612 instructions              #    0.61  insns per cycle        
                                             #    1.17  stalled cycles per insn
         4,300,427 branches                  #  171.884 M/sec                  
           169,950 branch-misses             #    3.95% of all branches         [42.87%]

       0.027811151 seconds time elapsed

$ perf_3.2.0-25 stat taskset -c 0-79 ./excessive_system_2_HEAD +RTS -N32 

 Performance counter stats for 'taskset -c 0-79 ./excessive_system_2_HEAD +RTS -N32':

        157.684408 task-clock                #    3.055 CPUs utilized          
               588 context-switches          #    0.004 M/sec                  
               147 CPU-migrations            #    0.001 M/sec                  
             1,245 page-faults               #    0.008 M/sec                  
       267,826,389 cycles                    #    1.698 GHz                     [54.94%]
       227,802,954 stalled-cycles-frontend   #   85.06% frontend cycles idle    [66.06%]
       184,677,238 stalled-cycles-backend    #   68.95% backend  cycles idle    [86.81%]
        70,842,584 instructions              #    0.26  insns per cycle        
                                             #    3.22  stalled cycles per insn [98.87%]
        16,579,295 branches                  #  105.142 M/sec                   [96.70%]
           233,921 branch-misses             #    1.41% of all branches         [97.31%]

       0.051618204 seconds time elapsed

It is interesting that without restriction we get many more instructions executed and also fewer instructions per cycle.

Also interesting, here are the top few entries perf recording with taskset 0-9

 5.64%  excessive_syste  [kernel.kallsyms]            [k] update_sd_lb_stats
  5.47%  excessive_syste  [kernel.kallsyms]            [k] resched_task
  4.97%  excessive_syste  excessive_system_2_HEAD      [.] evacuate1
  4.84%  excessive_syste  [kernel.kallsyms]            [k] __ticket_spin_lock
  4.18%  excessive_syste  libc-2.15.so                 [.] _int_malloc
  3.85%  excessive_syste  [kernel.kallsyms]            [k] __init_waitqueue_head
  3.81%  excessive_syste  [kernel.kallsyms]            [k] dequeue_entity
  3.48%  excessive_syste  [kernel.kallsyms]            [k] find_vma
  3.14%  excessive_syste  [kernel.kallsyms]            [k] native_write_msr_safe
  3.12%  excessive_syste  [kernel.kallsyms]            [k] source_load
  3.12%  excessive_syste  [kernel.kallsyms]            [k] idle_cpu
  2.62%  excessive_syste  [kernel.kallsyms]            [k] update_cfs_rq_blocked_load
  2.33%  excessive_syste  [kernel.kallsyms]            [k] perf_event_mmap_ctx
  2.13%  excessive_syste  [kernel.kallsyms]            [k] kmem_cache_alloc
  2.07%  excessive_syste  excessive_system_2_HEAD      [.] schedule

and with no restriction

 43.82%  excessive_syste  [kernel.kallsyms]        [k] __ticket_spin_lock
  6.44%  excessive_syste  [kernel.kallsyms]        [k] default_send_IPI_mask_sequence_phys
  4.68%  excessive_syste  [kernel.kallsyms]        [k] native_write_msr_safe
  3.45%  excessive_syste  [kernel.kallsyms]        [k] update_sd_lb_stats
  3.31%  excessive_syste  libc-2.15.so             [.] new_heap
  2.86%  excessive_syste  [kernel.kallsyms]        [k] idle_cpu
  2.40%  excessive_syste  [kernel.kallsyms]        [k] perf_event_mmap
  1.61%  excessive_syste  [kernel.kallsyms]        [k] __schedule
  1.60%  excessive_syste  excessive_system_2_HEAD  [.] s8Bc_info
  1.56%  excessive_syste  [kernel.kallsyms]        [k] inherit_event.isra.73
  1.50%  excessive_syste  [kernel.kallsyms]        [k] __init_waitqueue_head
  1.49%  excessive_syste  libc-2.15.so             [.] __clone
  1.45%  excessive_syste  [kernel.kallsyms]        [k] clear_page_c
  1.44%  excessive_syste  [kernel.kallsyms]        [k] try_to_wake_up

Pretty clearly spending more time in a spin lock when we run on all NUMA nodes.

changed milestone to %7.10.1

Whatever happens, I doubt this will be fixed in time for 7.8.3, but someone can bump it back if so.

mentioned in issue #9221

removed milestone

(Pushing back to 7.12 as this ticket requires more investigation for a fix.)

mentioned in issue #10728 (closed)

Trac metadata

Trac field	Value
Related	- → #9221

changed milestone to %8.0.1

Milestone renamed

changed milestone to %8.2.1

Sadly this won't be resolved for 8.0.1.

upvote

Trac metadata

Trac field	Value
CC	alpmestan@gmail.com, andreas.voellmy@gmail.com, gidyn, kazu@iij.ad.jp, nh2, rrnewton@gmail.com, simonmar, slyfox → alpmestan@gmail.com, andreas.voellmy@gmail.com, gidyn, hjarvi, kazu@iij.ad.jp, nh2, rrnewton@gmail.com, simonmar, slyfox

Replying to [ticket:8224#comment:126259 hjarvi]:

upvote

Do you have an example of an application which is particularly badly affected by this issue? It would be quite helpful if you could provide a testcase if so.

I tried reproducing this with cf5eec3e on a 2-socket, 24-core machine I have access to. I was unable to see anything like the system time numbers like reported in ticket:8224#comment:76935,

$ cat >T8224.hs <<EOF
main = return ()
EOF
$ inplace/bin/ghc-stage2 -threaded  -rtsopts T8224.hs
$ time ./T8224 +RTS -N32

real    0m0.016s
user    0m0.008s
sys     0m0.012s

and

$ cat >T8224a.hs <<EOF
import Control.Monad
import Control.Concurrent

main = do
  n <- getNumCapabilities
  forM_ [1..n] $ \_ -> forkIO $ return () 
  threadDelay (4 * 1000 * 1000)
EOF
$ inplace/bin/ghc-stage2 -threaded  -rtsopts T8224a.hs
$ time ./T8224a +RTS -N32                                                                                                      

real    0m4.018s
user    0m0.160s
sys     0m0.028s

All-in-all this doesn't seem terribly unreasonable to me.

Hi, as a developer on 48-core machine, I upvote the issue. I can reproduce it with many programs, e.g. this trivial one:

$ cat x.hs
main = getContents >>= print . length
$ ghc -O2 -rtsopts -threaded x
$ time head -c10MB /dev/zero | ./x 
10000000

real    0m0.182s
user    0m0.180s
sys     0m0.012s
$ time head -c10MB /dev/zero | ./x +RTS -N
10000000

real    0m1.782s
user    0m45.183s
sys     0m10.305s

The problem can be tamed with -A option (somewhere around 200M is optimal in my case), but still it's pretty bad:

$ time head -c10MB /dev/zero | ./x +RTS -N -A200M
10000000

real    0m0.712s
user    0m1.144s
sys     0m0.412s

Replying to [ticket:8224#comment:127188 alkar]:

Hi, as a developer on 48-core machine, I upvote the issue. I can reproduce it with many programs, e.g. this trivial one:
$ cat x.hs
main = getContents >>= print . length
$ ghc -O2 -rtsopts -threaded x
$ time head -c10MB /dev/zero | ./x 
10000000

real    0m0.182s
user    0m0.180s
sys     0m0.012s
$ time head -c10MB /dev/zero | ./x +RTS -N
10000000

real    0m1.782s
user    0m45.183s
sys     0m10.305s
The problem can be tamed with -A option (somewhere around 200M is optimal in my case), but still it's pretty bad:
$ time head -c10MB /dev/zero | ./x +RTS -N -A200M
10000000

real    0m0.712s
user    0m1.144s
sys     0m0.412s

Can you share a bit more info on your setup?

GHC version you tried it on
where does +RTS -s claim to attribute the time to
lstopo-no-graphics output
numactl -H output

A few tweaks happened in #9221 since latest GHC-8.0.1 release. It has a few more knobs to tune like -qn<value> (number of GC threads), -n<value> (nursery chunks), -qb0 (work-stealing nursery scan mode).

Attached file numactl.txt (download).

Attached file lstopo.txt (download).

In previous post I used ghc-7.10.1 by mistake. With 8.0.1 (stack ghc -- -O2 -rtsopts -threaded x) single core performance is same (around 0.18s), but with +RTS -N I get better results:

real    0m0.843s
user    0m22.149s
sys     0m2.256s

Even better with -A200M:

real    0m0.793s
user    0m1.200s
sys     0m0.692s

$ head -c10MB /dev/zero | ./x +RTS -s -N > /dev/null
     406,012,416 bytes allocated in the heap
      30,457,408 bytes copied during GC
         674,776 bytes maximum residency (2 sample(s))
         665,248 bytes maximum slop
              27 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       775 colls,   775 par   16.945s   0.429s     0.0006s    0.0302s
  Gen  1         2 colls,     1 par    0.052s   0.002s     0.0012s    0.0013s

  Parallel GC work balance: 0.14% (serial 0%, perfect 100%)

  TASKS: 98 (1 bound, 97 peak workers (97 total), using -N48)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    2.660s  (  0.091s elapsed)
  MUT     time   11.065s  (  0.744s elapsed)
  GC      time   16.997s  (  0.432s elapsed)
  EXIT    time    0.028s  (  0.003s elapsed)
  Total   time   30.750s  (  1.270s elapsed)

  Alloc rate    36,694,519 bytes per MUT second

  Productivity  36.1% of total user, 873.3% of total elapsed

gc_alloc_block_sync: 38774
whitehole_spin: 0
gen[0].sync: 408
gen[1].sync: 130

I attached outputs of lstopo (I don't have -no-graphics variant there) and numactl -H.

In reply to ticket:8224#comment:127189

Hi, as a developer on 48-core machine, I upvote the issue. I can reproduce it with many programs, e.g. this trivial one:

It is far from clear to me that we should expect the program you cited to run well with -N. It allocates heavily and -N will use parallel GC on a heavily imbalanced heap. Looking at the eventlog confirms that nearly all of the cores are simply waiting for GC. As expected disabling parallel GC with +RTS -qg drastically reduces system time due to reduced synchronization overhead.

Replying to [ticket:8224#comment:127191 bgamari]:

As expected disabling parallel GC with +RTS -qg drastically reduces system time due to reduced synchronization overhead.

Wow! Just tried +RTS -N -qg and brings performance back in the sane world! Now it's something like

real    0m0.269s
user    0m0.232s
sys     0m0.104s

That is, 30-40% slower than without -N.

A real-life example of a program where I'd want a long single thread computation with -N turned on is a faster analog of fgrep -f patterns.txt input1 ... inputN:

Read all lines from patterns.txt into a set.
In parallel for each input file, filter (\str -> member str set) . lines

Trac field	Value
Version	7.7
Type	Bug
TypeOfFailure	OtherFailure
Priority	high
Resolution	Unresolved
Component	Runtime System
Test case
Differential revisions
BlockedBy
Related
Blocking
CC	rrnewton@gmail.com
Operating system
Architecture

Excessive system time -- new IO manager problem?

Child items ...

Activity

Excessive system time -- new IO manager problem?

Relates to

Activity