Statistical CPU profiling in the RTS
This issue is about adding to the RTS the native feature of dumping execution stacks at chosen frequencies.
Current status
The current best (and only?) way to do statistical CPU profiling in Haskell is by using gdb
together with a custom plugin to walk Haskell stacks (see for instance this one). The custom plugin is required because the runtime is non-standard in various ways (among others, #8272). It works like this:
- Attach
gdb
to a running process - Repeatedly send
SIGPROF
at desired frequency - Have the
gdb
plugin find the address of stack pointers by navigating the program state - Walk the stacks and let
gdb
resolve addresses to symbols - Process the dump and plot (flamegraph)
This is a bit cumbersome, and also slows down execution of the running program quite a bit, to the extent that higher observation frequencies are impossible.
After reading the very interesting series of blog posts on DWARF by @bgamari & having a chat with him, I understand that various roads are being taken to improve the profiling landscape:
- Making the RTS more "standard": making the Haskell stack walkable like a C stack, storing
Sp
in the standard register (#8272). - Improving DWARF support, also using a variant of DWARF with one-to-many mappings between an address and the originating lines of code.
Actually, I believe we could reach native stack profiling in GHC even without these milestones:
-
Stack pointer. For an "internal" profiler, it does not really matter what register
Sp
is stored in. -
Symbol names. The action of dumping the stack can be separated from the action of resolving the names of functions. I think these are two orthogonal operations. Resolving names is slower and more complicated than simply dumping the stack contents: it requires the DWARF information (I see
libdw
is being used in the RTS) but DWARF data can be "incomplete" for compiled Haskell programs because of the limits of DWARF format (see this post). Resolving symbol names could be carried in a subsequent step, after collection.
Proposed approach
I propose to take inspiration from other runtime systems:
- Rely on kernel timer signals
- Send signal to all Haskell threads (or, to a random one, cf. "statistical")
- Kernel interrupts the thread and stores the current state (included registers) in a
ucontext_t
struct - Signal handler obtains
Sp
from theucontext_t
and dumps the stack in the EventLog - Handler returns and execution continues as usual
An advantage of this approach is that it doesn't require the RTS scheduler to perform context switches in order to observe the stack. On the contrary (if I understand correctly) the current profiling setup in the RTS forces threads to yield (by setting HpLim
to 0).
Another approach could be to force force threads to yield, and use the Sp
saved in the TSO. One issue would be about performance: switching with high observation frequencies would probably disrupt it? But more importantly, we wouldn't be able to profile for instance the time spent in tight loops without allocation, since threads can only yield when trying to allocate.
The approach above relies on the following assumptions:
- There is a way to tell if the interrupted thread was indeed executing STG code (i.e. if the expected register contains indeed
Sp
). - Haskell stacks should always be consistent, i.e. increment
Sp
only after a frame is fully allocated. I have the impression that this is not always the case, but maybe somebody more expert can advice.
We should also make the RTS dump at startup enough data to map from memory addresses to offsets in the ELF binary, so that symbol names can be resolved in post-processing.
(related to #10915)