SIGQUIT is intended to mirror the same signal's functionality provided by the JVM. That is, provide a snapshot of a processes' state in the form of backtraces on stderr. However, GHC's implementation currently only prints a backtrace for a single, arbitrary thread (namely that running via on the main capability). This should be fixed.
There is the question of what the precise semantics of SIGQUIT should be. Should we dump the state of all threads, all schedulable threads, or only currently scheduled threads? My feeling is the latter.
This isn't entirely trivial to implement, especially given that we need to ensure that the output is readable (e.g. prevent interleaving of output from different capabilities). One option would be to add a new Message variety which can be used to request a backtrace from a capability. The thread handling the SIGQUIT could then send this message to each capability and wait for their replies and print the result. This sounds complex for a debugging feature, however.
It will help my situation (trouble-shooting 100% ~ 400% CPU hogging) a lot more to just print the stack trace on busy capabilities.
But I have no experience with libdw, do you think it feasible to snapshot all busy thread's frames from the signal handling thread? I intend to work on such a tiny improvement if that's possible.
Maybe iteration of more threads can be done here? But seems our code is responsible to figure out the list of TIDs, I have no clue on how to obtain the list of busy threads at the moment, some insights please?
Looked into libdwfl, seems like we can iterate threads via dwfl_getthreads and obtain frames via dwfl_thread_getframes, then print like currently done.
I'm wondering whether need to check a thread be serving as a capability, or that just doesn't matter, if needed, how to?
Hi, this is cool! Could you post this as a merge request?
I don't have much insight on libdw, but a few comments:
Your sigRecieverThId logic looks racy. I would use a cas(&sigRecieverThid, 0, currThId)
You don't set sigRecieverThId back to 0 when you're finished. I guess this mostly works, the same thread is always the propogator, but it would be more robust to set back to 0.
As bgamari mentions in the comments above, your output callstacks can be interleaved. One (disgusting) way to achieve this would be to build a linked list with cass, then have the propogator busy-wait checking if the list is the right length, then writing the stacktraces and free the memory. I believe your posted callstacks actually are interleaved, e.g. the yieldCapability and schedule entries at the top of the second stack belong to the first stack.
n_capabilities can increase. It would be safest to have n_caps = RELAXED_LOAD(&n_capabilities) at the start and use n_caps, especially if you end up doing more logic with it.
@duog Thanks for your encouragement, I've further polished the code and opened !4786 (closed) , please continue review on it.
I gave up the propagator thread logic, as I figured out it won't work on other posix systems other than Linux, as the first receiver thread won't always be the sole receiver afterward, per posix signal delivery spec, any thread not blocking SIGQUIT can receive it from the os, and the solution to block SIGQUIT from all threads except one will not work, because when the propagated SIGQUIT is blocked, we can't be triggered to dump backtrace then.
Setting sigRecieverThId back to 0 is a solution hadn't come into my mind before I drafted the other solution in that MR, it might be a valid solution too, let's consider it along too.
The output above is actually truncated from seemingly infinite looping spitting the last line, anyhow, I later find all sort of strange things (partial trace backs and even no trace back at all e.g.) if I don't rm -rf ~/.cabal/store/ghc-8.10.3/ and cabal clean before cabal run again, after GHC rebuilt with slight source modification. A fresh cabal run with cleared intermediate artifacts
(update: not stably)
printed much more sane output.
RELAXED_LOAD(&n_capabilities) should definitely be incorporated, and also I read fprintf() is not safe to be used from a signal handler, so I'd put more thought on how to make it safer, that may even need a separate serializer thread I feel, I'll experiment some and update the MR for more discussion & review.
@duog a second thought on setting sigRecieverThId back to 0, I think it is still racy for other threads to see it set or cleared, when another thread responding to a propagated SIGQUIT but sees sigRecieverThId being 0, it will wrongly decide to further propagate the signal, thus causing storms of echos, I don't think we can safely instruct the 0-clearing by propagator thread to happen-after all other threads seeing a non-zero value, as those seeings are in signal handler code, usual mutex-like sync seems dangerous. Or do we have such options?
And I remembered that the original idea is to have it record the thread that ever received the very first SIGQUIT sent to the process, and given I wrongly assumed that subsequent SIGQUIT will be solely delivered to that very same thread later on (though that's true on Linux in usual cases), the expected protocol was sigRecieverThId get set once and remain unchanged since on. I later realized it won't work on other posix systems so abandoned the solution.
I later find all sort of strange things (partial trace backs and even no trace back at all e.g.) if I don't rm -rf ~/.cabal/store/ghc-8.10.3/ and cabal clean
Yes, it can be tricky working on your own ghcs. cabal has a --store-dir flag which you can use to point it a separate global store. If you are working with master branch then https://ghc.gitlab.haskell.org/head.hackage/ is a collection of patches to make various packages work.
I read fprintf() is not safe to be used from a signal handler,
Me too! I created #19205 for this. My understanding is that it's probably fine if you don't try to print floats. I don't think you should worry about this too much.
I see what you mean about the storm of echos. Unfortunately I don't think your timing solution solves this either. Processes can have arbitrarily long gaps between time slices; this can manifest on a system under heavy load, beeing oomkilled, waking from suspend, or if the process is SIGSTOPped.
usual mutex-like sync seems dangerous. Or do we have such options?
We don't really have such options. We can busy-wait though. while(SEQ_CST_LOAD(&still_going)) { busy_wait_nop(); }
may even need a separate serializer thread
This would likely be too high a cost to pay for a niche feature. However having an existing thread serialise the output may be a solution(bgamari elaborates in a comment above).
and the solution to block SIGQUIT from all threads except one will not work, because when the propagated SIGQUIT is blocked,
This is true, but there is nothing special about SIGQUIT here. We could use SIGQUIT for the leader, then it could send a SIG??? to follower threads. I don't know what a good choice is, and this would have to be documented.
I'm interested in how you intend to deal with interleaved output. I think a good first step is to print the threadId in each frame, so that we can easily see the interleaving. My inclination would be to have the leader busy-wait for each follower to finish, then do all the printing itself, which helps solve the "storm of echos" too.
I don't think your timing solution solves this either.
It's not perfect, but at least I think it will work in usual cases, and won't crash or flush in extreme situations, is there potential worst case failure unacceptable that I'm not aware of?
I'm interested in how you intend to deal with interleaved output.
I hesitate to add any form of wait to a signal handler, a more sophisticated solution in my mind might be something like Android's circular log buffer in the kernel, which can be consumed by its logcat utility.
Then here we still need a rival of logcat, could that be an external process, or an in-process thread, to dump async payload to stderr.
A full featured in-process circular log buffer is good to have IMHO, just how its cost justify given where we are, I'm glad to work on a basic version if we can agree on an implementation simple enough. Then maybe an RTS option controlled in-process thread, to keep dumping the payload to stderr.
Anyway, the unstable output as in !4787 is blocking my proceeding, it's meaningless before we can get stable meaningful stack traces from dwarf, I really need some hint to get over it first ...
I think a good first step is to print the threadId in each frame
maybe an RTS option controlled in-process thread, to keep dumping the payload to stderr.
On my mind now is something like a broadcast TChan for simplicity, instead of a real circular buffer consists of binary blocks to be filled with bytes; but C doesn't have garbage collector, so the desirable feature of a broadcast TChan, i.e. incoming items get simply discarded when no receiver attached, will not work right away, seems a dedicated thread is unavoidable, to at least free the malloc'ed payload, if configured not to really dump to stderr.
Or do you think we can just implement it using a real broadcast TChan, then add Haskell api to dup & read-print-loop? (But I'm not sure how safe it is to atomically write to a TChan from a signal handler).
790/* Return *PC (program counter) for thread-specific frame STATE. 791 Set *ISACTIVATION according to DWARF frame "activation" definition. 792 Typically you need to subtract 1 from *PC if *ACTIVATION is false to safely 793 find function of the caller. ACTIVATION may be NULL. PC must not be NULL. 794 Function returns false if it failed to find *PC. */795booldwfl_frame_pc(Dwfl_Frame*state,Dwarf_Addr*pc,bool*isactivation)796__nonnull_attribute__(1,2);
Typically you need to subtract 1 from *PC if *ACTIVATION is false to safely ...
I understand it should imply if (!is_activation) pc -= 1; but seems the contrary is done:
Is it a bug or my understanding should be corrected?
For the record I don't really understand whatever libdw's semantics about the ACTIVATION, just interpreted the sentence literally. And TBH I'm really confused by the subtract 1 thing, I'm no expert in that low level, but gut feeling tells me that function entry addresses should be aligned to 8-bytes or so, that 1 means 1 byte offset, then what's the hell?
Yes it looks like a bug to me. in C adding(or subtracting) 1 from a pointer modifies the value by the size of the pointers type. It works the same as arrays ,so (p + 1) == p[1]. Also (uint64_t)p + sizeof(*p) == (uint64_t)(p + 1).
Having said that, all backtraces will have a set_initial_registers frame at the top, so I'd be surprised if this bug is causing issues here.
Yes, I can be sure it's a bug by referencing libdw's print_frames(), but fixing it didn't change the symptom in apparent ways, then I found another quirk to workaround (!4787 (ddc679a5)) , so the back traces can be stably obtained now.
So a common way of handling asynchronous signals in a multi-threaded program is to mask signals in all the threads, and then create a separate thread (or threads) whose sole purpose is to catch signals and handle them. The signal-handler thread catches signals by calling the function sigwait() with details of the signals it wishes to wait for.
@bgamari I looked at how JVM has implemented it, turns out they do use a dedicated signal handling thread in the fashion described in the above article.
caseSIGBREAK:{#if INCLUDE_SERVICES...#endif// Print stack traces// Any SIGBREAK operations added here should make sure to flush// the output stream (e.g. tty->flush()) after output. See 4803766.// Each module also prints an extra carriage return after its output.VM_PrintThreadsop;VMThread::execute(&op);...
Actual dumping work done for VM_PrintThreads operation is here:
// Threads::print_on() is called at safepoint by VM_PrintThreads operation.voidThreads::print_on(outputStream*st,boolprint_stacks,boolinternal_format,boolprint_concurrent_locks,boolprint_extended_info){...ALL_JAVA_THREADS(p){ResourceMarkrm;p->print_on(st,print_extended_info);if(print_stacks){if(internal_format){p->trace_stack();}else{p->print_stack_on(st);}}st->cr();#if INCLUDE_SERVICES...#endif // INCLUDE_SERVICES}...
I kinda believe by safepoint, they mean the world is stopped with all Java threads pending, so all threads' stack can be observed and dumped. Nevertheless when the outputStream pends flow due to back pressure, I believe they are letting the whole world to wait.
Do we have a counterpart concept/mechanism in GHC/RTS as of safepoint in JVM?
If we can afford a dedicated thread to receive and handle/dispatch signals, that thread will be the perfect location to do the log dumping work safely wrt back pressure from stderr, in responding to SIGQUIT. Just what the JVM be doing all the years.
Btw they alias SIGQUIT to SIGBREAK like this:
// SIGBREAK is sent by the keyboard to query the VM state#ifndef SIGBREAK#define SIGBREAK SIGQUIT#endif
Do we have a counterpart concept/mechanism in GHC/RTS as of safepoint in JVM?
We do. GHC "capabilities" (which can be thought of as OS threads running Haskell code) can be interrupted by starting a "sync" event. This is, for instance, how garbage collection is initiated.
I think the best model to follow here is that of the non-moving GC's remembered set flush. Specifically, the non-moving collector triggers all capabilities to synchronize with the GC by way of a flush. The relevant moving parts are:
nonmovingBeginFlush, which calls stopAllCapabilitiesWith(..., SYNC_FLUSH_UPD_REM_SET)
stopAllCapabilities sets the interrupt flag for each capability
When a capability hits a yield point it will see that the interrupt flag is set and enter the scheduler, which will eventually enter yieldCapability
yieldCapability looks at the sync type and does the appropriate thing.
execution continues
To implement backtracing I would simply introduce a new sync type which prints a backtrace. This should be relatively simple. However, it does mean that threads that are stuck in non-allocating loops (which won't encounter a safepoint) won't produce backtraces. Nevertheless, I think this is acceptable; in such a case the user can just whip out gdb.
This is promising. Back when I was tinkering with the codebase, I saw capability sync very appealing for useful things to be done with it, but as it seems "hard-coded" and connected to many things I can't understand by far, I was frightened off to play with it.
That said I'm still worried about what backtraces to be printed out, as my case has shown, costcenter stacks appear way more useful, i.e. backtraces hitting scheduling code or GC code reveals nothing about misbehaving of the application code, such hits, even with a handful of stack frames (program-counter addresses) based on hardware call convention, will be of little information but majorly noise wrt the diagnostic practice.
I reproduced my situation with 2 threads hogging the CPU, but the most seemingly useful hot spots as captured will stop at functions like textzm1zi2zi4zi1_DataziText_length_info (libraries/text/src/Data/Text.hs:630.1)(null) (Text/Megaparsec/Internal.hs:317.33)megaparseczm9zi0zi1zmf1fe707daa2bef8db9d3ff7f9f72aee8df8d35fd20edcae7fcf9b53d5659e4b5_TextziMegaparsecziInternal_zdfFunctorParsecT1_info (Text/Megaparsec/Internal.hs:159.3), I'm not sure that's the effect of TCO (tail call optimization)?
Rest captures caught scheduling, GC, profiling spots as shown in the attached log.
Please help to see if anything wrong with such result, and then maybe we should resort to Haskell stack frames instead of the hardware call-convention based stack?
I looked deeper into the code, and got this idea to instead print backtrace of CostCentreStack with sth like fprintCCS_stderr() from each thread:
Have a global counter, increased per SIGQUIT received (then the signal handler just need the very safe atomic increase)
Add a per-thread counter field to struct StgTSO_, make it follow the global counter as the thread get scheduled
The following up action be performed at each Haskell yield point, when thread's counter does not match global counter, do print the CostCentreStack of either just-done or next-coming work, or both maybe, then update thread's counter to match global counter
Interleaving of output may be solved the same way however parallel trace is mitigated.
That might be really useful in diagnosing CPU hogging as in my situation, do you think it feasible?
Turns out cost center based backtraces are more informative than dwarf ones, I tracked down my CPU hogging bug (excessive STM retries due to mis-usage).
Shortcomings are apparent, especially a backtrace won't be printed out as soon as Ctrl^\ is pressed, unless the thread is busy on CPU. Though that's not a problem in hogging cases. And as it is currently implemented, the backtrace reveals what work will be scheduled after Ctrl^\ is pressed, not what has been keeping the CPU busy at that time, this also not quite a problem when the CPU is hogged for a long time, when the culprit code keeps coming back all the time.
It currently depends on a profiling build, I wonder if the implicit ?callStack parameter can be leveraged for builds without profiling enabled, but have no idea where to look for that.
Please review MR !4891 and suggest if/how we can go further this way.