Don't brand heap overflow as GHC bug

Seems reasonable to me.

added newcomer label

added Pnormal Tfeature request error messages labels

Hi @AndreasK, I'm a newbie, and this looked like a good place to start contributing. Does this issue have anyone assigned to it at the moment? If not, can I put my hand up for it?

Thanks!

Nobody is currently working on this, and we would welcome you giving it a go!

It might be as simple as calling errorBelch and stg_exit instead of barf where we currently generate the error.

Hmm, I've been looking through rts/Schedule.c. I don't understand all of it, but it seems like there isn't an easy way to prevent a panic in some situations when heap overflow happens.

I'm still figuring out how exception handling is passed between the RTS and the boot libraries, and will keep digging, but I suspect this is going to be a bit tricky

Possibly you only need to change compiler/GHC.hs defaultErrorHandler, see the stack overflow case.

That is a very helpful suggestion - thank you very much!!

mentioned in merge request !13209 (closed)

I agree that the current error message is misleading. Experienced users would probably not file a ghc bug. Inexperienced users might as they probably wouldn't know how to solve their issue. I don't think it will suffice to refer the user to the section of the user guide which documents the heap-controlling flags of the RTS . What action would the user take after referring to that page? I'm assuming the most common cause of the error is a memory leak. At a minimum we probably need to also refer to the profiling section of the user guide. We might want to add something to that section of the user guide and also something to the section "What to do when something goes wrong"

I think it would be good to have an explicit error message that we agree on before changing the code.

Hi @trac-george.colpitts, In my (still draft) merge request, I have the following error message: "heap overflow: use +RTS -M to increase maximum heap size". Very happy to hear suggestions for improvements!

mentioned in issue #18898 (closed)

Hi Ignatius, @abel, @AndreasK,

Ignatius, thanks for the quick reply!

That error messages assumes that the user has set a heap limit. In that case your proposed error message will resolve the issue.

However the default heap limit is "unlimited" (see the link mentioned at the start of this issue: https://downloads.haskell.org/ghc/latest/docs/users_guide/runtime_control.html#rts-flag--M%20%E2%9F%A8size%E2%9F%A9 ).

Even in the case of "unlimited heap" a heap overflow error can occur when swap space is exhausted. In that case your error message probably won't be helpful in fixing the issue. I believe that such an issue is most often due to a memory leak and the fix involves fixing the memory leak. Thus I think that it would be good to also mention this in the error message and to document the recommended process to fix a memory leak

Of course I might be wrong in which case I would be happy to be corrected.

Even if I am right the proposed error message is better than the current one and does resolve the problem of a user setting a heap limit which is too small for their program.

I could file a separate bug about the lack of a documented process for fixing memory leaks as a separate (documentation) bug if everybody agrees that this is the best resolution, at least for now.

The only document I know of is the ACM Queue article by Neil Mitchell, "Leaking Space", https://queue.acm.org/detail.cfm?id=2538488. Most of that article is about space leaks not memory leaks but it does have a few helpful paragraphs about memory leaks and profiling tools.

Excellent point about the +RTS -M default being unlimited!

Thanks. That made me realize that it would be good to check if the heap size is not unlimited before giving an error message saying to increase it. That still leaves the question of what to say when heap size is unlimited.

@trac-george.colpitts updated suggested error text:

"heap overflow: If you have used +RTS -M to set a non-default maximum heap size, then you may need to increase this value. If not, your program may have a memory leak. Consider using profiling tools like valgrind."

I'm in the process of studying this TWEAG post (also with the aim of finding out how best to generate a memory leak to test that the patch in !13209 (closed) works - fyi @mpickering).

UPDATE: @trac-george.colpitts Oh I just re-read your comment and see that you are suggesting showing a different error message depending on if a non-default -M value is set

I'm in the process of studying this TWEAG post

Glad you're finding it helpful!

@@doyougnu

Your paper is very interesting, thanks for writing it up. To debug the heap overflow wouldn't it suffice to use heap profiling? It would show the unevaluated thunks. What is the value added of valgrind? Apologies if I missed an explanation of that in yoir article.

@trac-george.colpitts, I'm very happy to attempt to change the error message conditional on whether a non-default -M flag has been set but, in my enthusiasm to contribute, I dived in without figuring out a good way to reproduce the behaviour described in the issue

@doyougnu, I used the helpful script from your post, and it causes a memory leak - but the process gets Killed by my OS rather than showing the error @abel reported. But @abel's error occured when compiling Agda...

Does anyone have a suggestion for how to cause a memory leak during compilation (if that is the way to reproduce this behaviour)?

Your paper is very interesting, thanks for writing it up. To debug the heap overflow wouldn't it suffice to use heap profiling? It would show the unevaluated thunks. What is the value added of valgrind?

For heap overflow I would certainly use heap profiling perhaps with ghc-debug (although I still haven't dug into ghc-debug). I think we already have good tools (such as eventlog and tickyticky) to visualize the heap and track down memory leaks.

The value added with cachegrind in particular is getting information about the cache behavior of the program. In particular if your program has some hot loop, and in that hot loop you're accidentally evicting a lot of the data cache to fetch data from the heap. This won't show up in allocations or in a tickyticky or in a normal profile, but it will show up in runtime and latency between subsystems of your program and your end-user will experience it.

Another benefit is inspecting the instruction cache which can also suffer pipeline hazards if the assembly instructions required to execute a chunk of code are greater than the size of the instruction cache for the CPU. This sort of thing is important but is usually not where you want to begin optimizing since its particular for your exact machine. For example in this chapter of HoH I show, using perf (but you could also use cachegrind) that for a specific nofib benchmark, on my system, tables-next-to-code is causing a lot of L1 instruction cache misses that originate from bignum. Now, if this program was what my company depended on to make money, and if my exact hardware is what I was going to run that program on, then I would begin optimizing to reduce these L1 instruction cache misses. But in general, this kind of optimization is not worth the time and effort.

Does anyone have a suggestion for how to cause a memory leak during compilation (if that is the way to reproduce this behaviour)?

I think this should be a separate ticket from this one. Something like ghc-blah-blah exhausts heap when compling agda-blah-blah, unless that already exists.

Then here is what I would do:

checkout the release branch of whatever GHC version that was used
build a profiled ticky'd ghc
checkout the agda commit that failed
figure out what to pass to cabal -w <path to the profiled ghc> --allow-newer blah to recreate the invocation the CI used, which was: time stack build Agda --no-haddock --test --no-run-tests --flag Agda:optimise-heavily --no-library-profiling --flag Agda:debug --flag Agda:enable-cluster-counting --ghc-options "+RTS -A128M -M4G -RTS"
Then once can get this to build, I would alter the ghc-options to pass eventlog and ticky flags which would generate the reports. You'll should still be able to read the eventlog file and see a heap profile (or speedscope profile) if the process gets killed.

This looks interesting @doyougnu - I'm going to finish up my MR for this current issue, then follow your suggestions to explore memory usage during Agda compilation. Your point 4 is really relevant to this issue: it confirms that --ghc-options "+RTS ... -M flag was set - which explains the behaviour I see in my comment below.

If I understand things correctly then you can only get a heap overflow when using -M or similar flags. If you exhaust the systems memory then the OS or OOM killer will kill your process with a signal instead, rather than the RTS heap overflow error being triggered. As it stands the RTS is not aware of the size of remaining system memory, etc.

So I think suggesting to increase the -M flag is correct.

@ignatiusm The OP happened with insufficient heap cause by a too low -M value. This should be easy to reproduce with even a small test case. Just find a value for -M that is low enough.

Thanks for the suggestion @abel - Interestingly, just setting a low -M value causes a different message from rts/hooks/OutOfHeap.c (see screenshot below and comment in my draft MR). Hence why I'm puzzled how to replicate the initial issue.

For reference, the my heap-overflow.hs is just a copy of the memory-leak code from @doyougnu's TWEAG post.

I'll try to do some more digging

Don't brand heap overflow as GHC bug

Child items ...

Activity