Statistical CPU profiling in the RTS

This issue is about adding to the RTS the native feature of dumping execution stacks at chosen frequencies.

Current status

The current best (and only?) way to do statistical CPU profiling in Haskell is by using gdb together with a custom plugin to walk Haskell stacks (see for instance this one). The custom plugin is required because the runtime is non-standard in various ways (among others, #8272). It works like this:

Attach gdb to a running process
Repeatedly send SIGPROF at desired frequency
Have the gdb plugin find the address of stack pointers by navigating the program state
Walk the stacks and let gdb resolve addresses to symbols
Process the dump and plot (flamegraph)

This is a bit cumbersome, and also slows down execution of the running program quite a bit, to the extent that higher observation frequencies are impossible.

After reading the very interesting series of blog posts on DWARF by @bgamari & having a chat with him, I understand that various roads are being taken to improve the profiling landscape:

Making the RTS more "standard": making the Haskell stack walkable like a C stack, storing Sp in the standard register (#8272).
Improving DWARF support, also using a variant of DWARF with one-to-many mappings between an address and the originating lines of code.

Actually, I believe we could reach native stack profiling in GHC even without these milestones:

Stack pointer. For an "internal" profiler, it does not really matter what register Sp is stored in.
Symbol names. The action of dumping the stack can be separated from the action of resolving the names of functions. I think these are two orthogonal operations. Resolving names is slower and more complicated than simply dumping the stack contents: it requires the DWARF information (I see libdw is being used in the RTS) but DWARF data can be "incomplete" for compiled Haskell programs because of the limits of DWARF format (see this post). Resolving symbol names could be carried in a subsequent step, after collection.

Proposed approach

I propose to take inspiration from other runtime systems:

Rely on kernel timer signals
Send signal to all Haskell threads (or, to a random one, cf. "statistical")
Kernel interrupts the thread and stores the current state (included registers) in a ucontext_t struct
Signal handler obtains Sp from the ucontext_t and dumps the stack in the EventLog
Handler returns and execution continues as usual

An advantage of this approach is that it doesn't require the RTS scheduler to perform context switches in order to observe the stack. On the contrary (if I understand correctly) the current profiling setup in the RTS forces threads to yield (by setting HpLim to 0).

Another approach could be to force force threads to yield, and use the Sp saved in the TSO. One issue would be about performance: switching with high observation frequencies would probably disrupt it? But more importantly, we wouldn't be able to profile for instance the time spent in tight loops without allocation, since threads can only yield when trying to allocate.

The approach above relies on the following assumptions:

There is a way to tell if the interrupted thread was indeed executing STG code (i.e. if the expected register contains indeed Sp).
Haskell stacks should always be consistent, i.e. increment Sp only after a frame is fully allocated. I have the impression that this is not always the case, but maybe somebody more expert can advice.

We should also make the RTS dump at startup enough data to map from memory addresses to offsets in the ELF binary, so that symbol names can be resolved in post-processing.

(related to #10915)

Edited Jun 04, 2021 by Andrea Condoluci

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information