Excessive system time -- new IO manager problem?

changed milestone to %7.8.3

changed weight to 7

Attached file excessive_system.hs ($900).

Simple program with many MVar-blocked threads.

I also see the 200ms system time for excessive_system.hs when I run it with -N32 (and compiled with recent HEAD). Below is the first few entries after running perf record; it looks like it may be GC related. I don't see any IO manager related entries in the list.

39.02% excessive_syste excessive_system [.] gcWorkerThread 15.53% excessive_syste [kernel.kallsyms] [k] __ticket_spin_lock 6.10% excessive_syste [kernel.kallsyms] [k] default_send_IPI_mask_sequence_phys 5.88% excessive_syste excessive_system [.] allocBlock_sync 3.95% excessive_syste [kernel.kallsyms] [k] update_sd_lb_stats 2.76% excessive_syste excessive_system [.] steal_todo_block 2.18% excessive_syste [kernel.kallsyms] [k] native_write_msr_safe 1.65% excessive_syste libc-2.15.so [.] 0x7e21f 1.52% excessive_syste excessive_system [.] evacuate 0.75% excessive_syste [kernel.kallsyms] [k] dequeue_task_fair 0.74% excessive_syste [kernel.kallsyms] [k] do_mmap_pgoff 0.67% excessive_syste libc-2.15.so [.] __clone 0.67% excessive_syste [kernel.kallsyms] [k] memset 0.66% excessive_syste [kernel.kallsyms] [k] dequeue_entity 0.66% excessive_syste [kernel.kallsyms] [k] release_pages 0.62% excessive_syste [kernel.kallsyms] [k] find_next_bit 0.58% excessive_syste excessive_system [.] scavenge_until_all_done 0.56% excessive_syste [kernel.kallsyms] [k] __schedule 0.54% excessive_syste [kernel.kallsyms] [k] hrtimer_interrupt

I haven't tested this (don't have a 32 core machine,) but if the gcWorkerThread is causing problems, what does it look like when you turn off the multicore collector and load balancer? i.e. run with:

$ ./excessive_system +RTS -N -qb -qg

Good idea. I still see about the same amount of system time. The __ticket_spin_lock is still at the top:

 Events: 1K cycles
#
# Overhead          Command       Shared Object                               Symbol
# ........  ...............  ..................  ...................................
#
    34.41%  excessive_syste  [kernel.kallsyms]   [k] __ticket_spin_lock
    11.94%  excessive_syste  [kernel.kallsyms]   [k] default_send_IPI_mask_sequence_phys
     4.48%  excessive_syste  [kernel.kallsyms]   [k] update_sd_lb_stats
     3.39%  excessive_syste  [kernel.kallsyms]   [k] native_write_msr_safe
     1.86%  excessive_syste  [kernel.kallsyms]   [k] __slab_free
     1.85%  excessive_syste  [kernel.kallsyms]   [k] idle_cpu
     1.55%  excessive_syste  libc-2.15.so        [.] 0x7e175         
     1.39%  excessive_syste  [kernel.kallsyms]   [k] perf_event_alloc
     1.36%  excessive_syste  [kernel.kallsyms]   [k] dequeue_task_fair
     1.21%           :47802  [kernel.kallsyms]   [k] page_remove_rmap
     1.09%  excessive_syste  [kernel.kallsyms]   [k] mmap_region
     0.99%  excessive_syste  [kernel.kallsyms]   [k] __init_waitqueue_head
     0.94%  excessive_syste  [kernel.kallsyms]   [k] memset
     0.92%  excessive_syste  [kernel.kallsyms]   [k] do_futex
     0.90%  excessive_syste  [kernel.kallsyms]   [k] perf_event_mmap_ctx
     0.89%  excessive_syste  [kernel.kallsyms]   [k] kmem_cache_alloc_trace
     0.87%  excessive_syste  libpthread-2.15.so  [.] __deallocate_stack
     0.86%  excessive_syste  excessive_system    [.] waitCondition
     0.86%  excessive_syste  [kernel.kallsyms]   [k] __rwsem_do_wake
     0.85%  excessive_syste  [kernel.kallsyms]   [k] get_page_from_freelist
     0.84%  excessive_syste  [kernel.kallsyms]   [k] find_vma
     0.84%  excessive_syste  excessive_system    [.] s6t3_info
     0.83%  excessive_syste  [kernel.kallsyms]   [k] sys_read
     0.82%  excessive_syste  [kernel.kallsyms]   [k] kfree
     0.78%  excessive_syste  [kernel.kallsyms]   [k] __raw_write_unlock_irq.constprop.31

Even with MVars removed and each thread simply waiting (see attached program) it takes 200ms system time with -N32. Here is a perf recording with call stack info; here is most of the first entry:

# Events: 1K cycles
#
# Overhead          Command       Shared Object                               Symbol
# ........  ...............  ..................  ...................................
#
    51.37%  excessive_syste  [kernel.kallsyms]   [k] __ticket_spin_lock
            |
            --- __ticket_spin_lock
               |          
               |--88.44%-- _raw_spin_lock_irq
               |          |          
               |          |--89.17%-- get_signal_to_deliver
               |          |          do_signal
               |          |          do_notify_resume
               |          |          int_signal
               |          |          |          
               |          |          |--94.68%-- __lll_lock_wait_private
               |          |          |          
               |          |           --5.32%-- epoll_wait
               |          |                     |          
               |          |                     |--67.64%-- 0x7ff06806d533
               |          |                     |          0x8758b4830558b48
               |          |                     |          
               |          |                     |--18.03%-- 0x7ff06806690b
               |          |                     |          0x8758b4830558b48
               |          |                     |          
               |          |                      --14.33%-- 0x7ff068062b2b
               |          |                                0x8758b4830558b48
               |          |          
               |          |--8.08%-- acct_collect
               |          |          do_exit
               |          |          do_group_exit
               |          |          get_signal_to_deliver
               |          |          do_signal
               |          |          do_notify_resume
               |          |          int_signal
               |          |          __lll_lock_wait_private
               |          |

Attached file excessive_system_2.hs ($901).

Simplified version

Actually, I can simplify the program to just

main = return ()

and I get 200ms in kernel when run with -N32. I'm not sure how significant this 200ms time is - I guess it is time spent initializing or cleaning up capabilities (possibly including IO manager resources).

Thanks for the perf data!

Right, the 200ms isn't really bad. Can anyone reproduce the 8 *minutes* of system time I was seeing on the "ghc-parmake-gsoc" branch? I'd like to figure out whether that problem has anything to do with the patches on #910 (closed), or is unrelated.

Was there a switch from some other kind of lock to a spin lock at some point?

200ms spent doing unspecified "initialization" should definitely be investigated, that's a lot. I can't see anything useful in that call stack, it looks like it might be related to signals or epoll() but it's hard to tell anything else.

What does the +RTS -s look like?

One reason for extra user/system time that could account for your results in #910 (closed) is that if there isn't enough parallelism during GC then you'll have 30 threads spinning looking for work, and doing the occasional sched_yield(). Spinning accounts for the user time, the sched_yield() accounts for the system time. Still, it does look like a lot, so it would be good to look at some !ThreadScope traces.

That doesn't account for the 200ms in the empty program though, so there could be something else going on here.

Trac metadata

Trac field	Value
CC	andreas.voellmy@gmail.com, rrnewton@gmail.com → alpmestan@gmail.com, andreas.voellmy@gmail.com, rrnewton@gmail.com

Trac metadata

Trac field	Value
CC	alpmestan@gmail.com, andreas.voellmy@gmail.com, rrnewton@gmail.com → alpmestan@gmail.com, andreas.voellmy@gmail.com, kazu@iij.ad.jp, rrnewton@gmail.com, simonmar

I'm not sure about the 8 minutes problem, but the 200ms system time for the empty program does indeed seem to be related to the parallel IO manager changes. I just modified the IO manager to start only one epoll instance and I get about 50 ms system time, whereas with HEAD I get about 200ms still. I will upload the patch to make the IO manager use just one poller in case anyone else wants to try it.

Attached file one-poller.patch (download).

Hack to use just one poller in the IO manager.

Replying to [ticket:8224#comment:75150 simonmar]:

200ms spent doing unspecified "initialization" should definitely be investigated, that's a lot. I can't see anything useful in that call stack, it looks like it might be related to signals or epoll() but it's hard to tell anything else.

What does the +RTS -s look like?

With my "one poller" hack:

          48,624 bytes allocated in the heap
              40 bytes copied during GC
          38,592 bytes maximum residency (1 sample(s))
         268,608 bytes maximum slop
              32 MB total memory in use (15 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.00s    0.00s     0.0003s    0.0003s

  Parallel GC work balance: -nan (0 / 0, ideal 32)

                        MUT time (elapsed)       GC time  (elapsed)
  Task  0 (worker) :    0.00s    (  0.00s)       0.00s    (  0.00s)
  Task  1 (worker) :    0.00s    (  0.00s)       0.00s    (  0.00s)
  Task  2 (bound)  :    0.00s    (  0.00s)       0.00s    (  0.00s)
  Task  3 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task  4 (worker) :    0.02s    (  0.01s)       0.00s    (  0.00s)
  Task  5 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task  6 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task  7 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task  8 (worker) :    0.02s    (  0.01s)       0.00s    (  0.00s)
  Task  9 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 10 (worker) :    0.02s    (  0.01s)       0.00s    (  0.00s)
  Task 11 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 12 (worker) :    0.02s    (  0.01s)       0.00s    (  0.00s)
  Task 13 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 14 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 15 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 16 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 17 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 18 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 19 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 20 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 21 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 22 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 23 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 24 (worker) :    0.00s    (  0.01s)       0.00s    (  0.00s)
  Task 25 (worker) :    0.00s    (  0.01s)       0.00s    (  0.00s)
  Task 26 (worker) :    0.00s    (  0.01s)       0.00s    (  0.00s)
  Task 27 (worker) :    0.00s    (  0.01s)       0.00s    (  0.00s)
  Task 28 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 29 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 30 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 31 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 32 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)
  Task 33 (worker) :    0.01s    (  0.01s)       0.00s    (  0.00s)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.00s  (  0.01s elapsed)
  MUT     time    0.00s  (  0.00s elapsed)
  GC      time    0.00s  (  0.00s elapsed)
  EXIT    time    0.01s  (  0.00s elapsed)
  Total   time    0.02s  (  0.01s elapsed)

  Alloc rate    2,431,200 bytes per MUT second

  Productivity  80.0% of total user, 132.3% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

With HEAD:

         543,096 bytes allocated in the heap
              64 bytes copied during GC
         412,256 bytes maximum residency (1 sample(s))
         279,968 bytes maximum slop
              18 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0         0 colls,     0 par    0.00s    0.00s     0.0000s    0.0000s
  Gen  1         1 colls,     0 par    0.00s    0.00s     0.0009s    0.0009s

  Parallel GC work balance: -nan% (serial 0%, perfect 100%)

  TASKS: 65 (1 bound, 64 peak workers (64 total), using -N32)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.04s  (  0.03s elapsed)
  MUT     time    0.00s  ( -0.00s elapsed)
  GC      time    0.00s  (  0.00s elapsed)
  EXIT    time    0.01s  (  0.00s elapsed)
  Total   time    0.05s  (  0.04s elapsed)

  Alloc rate    0 bytes per MUT second

  Productivity  17.4% of total user, 24.7% of total elapsed

gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0

Trac field	Value
Version	7.7
Type	Bug
TypeOfFailure	OtherFailure
Priority	high
Resolution	Unresolved
Component	Runtime System
Test case
Differential revisions
BlockedBy
Related
Blocking
CC	rrnewton@gmail.com
Operating system
Architecture

Excessive system time -- new IO manager problem?

Child items ...

Activity

Excessive system time -- new IO manager problem?

Relates to

Activity