Statistical CPU profiling in the RTS
This issue is about adding to the RTS the native feature of dumping execution stacks at chosen frequencies.
Current status
The current best (and only?) way to do statistical CPU profiling in Haskell is by using gdb together with a custom plugin to walk Haskell stacks (see for instance this one). The custom plugin is required because the runtime is non-standard in various ways (among others, #8272). It works like this:
- Attach
gdbto a running process - Repeatedly send
SIGPROFat desired frequency - Have the
gdbplugin find the address of stack pointers by navigating the program state - Walk the stacks and let
gdbresolve addresses to symbols - Process the dump and plot (flamegraph)
This is a bit cumbersome, and also slows down execution of the running program quite a bit, to the extent that higher observation frequencies are impossible.
After reading the very interesting series of blog posts on DWARF by @bgamari & having a chat with him, I understand that various roads are being taken to improve the profiling landscape:
- Making the RTS more "standard": making the Haskell stack walkable like a C stack, storing
Spin the standard register (#8272). - Improving DWARF support, also using a variant of DWARF with one-to-many mappings between an address and the originating lines of code.
Actually, I believe we could reach native stack profiling in GHC even without these milestones:
-
Stack pointer. For an "internal" profiler, it does not really matter what register
Spis stored in. -
Symbol names. The action of dumping the stack can be separated from the action of resolving the names of functions. I think these are two orthogonal operations. Resolving names is slower and more complicated than simply dumping the stack contents: it requires the DWARF information (I see
libdwis being used in the RTS) but DWARF data can be "incomplete" for compiled Haskell programs because of the limits of DWARF format (see this post). Resolving symbol names could be carried in a subsequent step, after collection.
Proposed approach
I propose to take inspiration from other runtime systems:
- Rely on kernel timer signals
- Send signal to all Haskell threads (or, to a random one, cf. "statistical")
- Kernel interrupts the thread and stores the current state (included registers) in a
ucontext_tstruct - Signal handler obtains
Spfrom theucontext_tand dumps the stack in the EventLog - Handler returns and execution continues as usual
An advantage of this approach is that it doesn't require the RTS scheduler to perform context switches in order to observe the stack. On the contrary (if I understand correctly) the current profiling setup in the RTS forces threads to yield (by setting HpLim to 0).
Another approach could be to force force threads to yield, and use the Sp saved in the TSO. One issue would be about performance: switching with high observation frequencies would probably disrupt it? But more importantly, we wouldn't be able to profile for instance the time spent in tight loops without allocation, since threads can only yield when trying to allocate.
The approach above relies on the following assumptions:
- There is a way to tell if the interrupted thread was indeed executing STG code (i.e. if the expected register contains indeed
Sp). - Haskell stacks should always be consistent, i.e. increment
Sponly after a frame is fully allocated. I have the impression that this is not always the case, but maybe somebody more expert can advice.
We should also make the RTS dump at startup enough data to map from memory addresses to offsets in the ELF binary, so that symbol names can be resolved in post-processing.
(related to #10915)