I cannot come up with a good title, so please bear with me with the title.
I've been using XMonad with GHC 9.0.2 fine,
but when I updated it to GHC9.2.2, it often crashes.
It sometimes leave the following message, while sometimes silently crashes.
xmonad-x86_64-linux: internal error: evacuate: strange closure type 0 (GHC version 9.2.2 for x86_64_unknown_linux) Please report this as a GHC bug: https://www.haskell.org/ghc/reportabugAborted (core dumped)
(The closure type depends time to time)
Happens with the most bare-bone XMonad instance. I am lost on how to skin it down further.
Reporting this to GHC, since this specifically happens with GHC9.2.2. All previous versions work fine.
Any directions to diagnose which part is causing the issue?
Steps to reproduce
Install XMonad as dynamically linked
Run with the default configuration
(At least on my end) performing some action leads to a crash from time to time
Expected behavior
internal error: evacuate: strange closure type 0 should not happen.
Environment
GHC version used: 9.2.2
Optional:
Operating System: Ubuntu 20.04.4 LTS, 64-bit, Linux
System Architecture: x86-64
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Oh dear, this indeed sounds quite bad. It's possible that this is a memory safety issue in some library but without a reproducer it is hard to say with certainty. How frequently does this crash occur? I would suggest the following
recompile with debug information enabled (using cabal's --enable-debug-info=3 flag)
recompile against the debug RTS (using GHC's -debug flag; e.g. ghc-options: -debug in your .cabal file)
Note the curious absence of a "strange closure type" error. They do look very similar to the ones posted in the xmonad issue though.
The crash always happens when doing something involving xft fonts—which, I guess, the core dump also indicates. This means that, in particular, I can't reproduce it with a bare xmonad config. So far, I've only managed to make it crash with my personal config, which uses xft fonts for e.g. the prompt from xmonad-contrib. It also feels very random: I can't reliably reproduce it with any certain sequence of keys.
Funnily enough, even though I really tried, I can't reproduce it at all once I start compiling with debugging information or run xmonad under rr. I once had this before—another program that would crash but then not anymore when run with strace—and that time it was related to a race condition that wasn't triggered, since the debugging information made the execution just that tiny bit slower.
Sorry if none of this was particularly helpful. I will keep trying for a minimal reproducer.
I tried (and can reproduce this) with both 9.2.3 and 9.2.2. It was definitely fine with 9.0.x and below, so I suppose I could try to bisect this (would be a long weekend though, that's for sure, especially with me still not having a reliable reproducer). I can also try with HEAD at some point, if that'd be helpful.
I have reproduced this with bgamari/t21708-repro>. The trick appears to be building with -threaded and using +RTS -N > 1. In particular, things crash quite reliably with -N8. The problem appears to be that XNextEvent is trampling on live heap objects.
The problem appears to be similar to that seen in #14346 (closed). Specifically, the main loop of xmonad is [XMonad.Main.launch](https://github.com/xmonad/xmonad/blob/9189d002dd8ed369b822b10dcaae4bb66d068670/src/XMonad/Main.hs#L233) which allocates a pinned ByteArray#buffer (to hold an X11 event) and then repeatedly callsXNextEventto fill it in aforever` loop.
Intriguingly, it appears that this event buffer is being inappropriately collected by the GC. The buffer is here:
Note the 0x42023063e0 pointer in the launch1_info stack frame; this is the address of the ByteArray#'s payload, not the ByteArray# itself. Consequently, at this point there is nothing keeping the ByteArray# alive. This is quite strange since if we look at the simplified Core of launch1 we see that all uses of the ByteArray# are enclosed in a keepAlive#. Moreover, looking at the STG we see that this keepAlive# was correctly lowered to a touch# on the ByteArray#. I suppose the code generator's free variable analysis must somehow be concluding that the ByteArray# isn't reachable despite the touch#.
xmonadzm0zi17zi0zi9zminplace_XMonadziMain_launch1_info+7241 appears to be the block chON, which is the continuation of the call to prehandle with a stack layout:
Well this is curious: the code after the allocation of the ByteArray# in question is:
cR5k:// ...R1=192;// CmmAssign <---- this is indeed the right size for an XEvent// ...callstg_newPinnedByteArray#(R1)returnstocR53,args:8,res:8,upd:8;// CmmCallcR53:unwindSp=JustSp+48;// CmmUnwind <---- at this point R1 is the ByteArray#// case binder: ds61_sPi9// case binder: ds62_sPic// case binder: ev_sPif_sPeV::I64=I64[Sp+16];// CmmAssign_sPi5::P64=P64[Sp+32];// CmmAssign_sPif::I64=R1+16;// CmmAssign <--- this is the byteArrayContents resultI32^[_sPif::I64]=22::W32;// CmmStore// case binder: s2_sPig// case binder: sat_sPihI64^[_sPif::I64+32]=_sPeV::I64;// CmmStore// case binder: s3_sPii// case binder: sat_sPijI64^[_sPif::I64+40]=_sPeV::I64;// CmmStore// case binder: s4_sPikI64[Sp-8]=cR5q;// CmmStore <--- Here we start constructing a continuation stack frame_sPie::P64=R1;// CmmAssign <--- _sPie is the ByteArray#R1=_sPi5::P64;// CmmAssignI64[Sp]=_sPif::I64;// CmmStore <--- here we are pushing a continuationP64[Sp+32]=_sPie::P64;// CmmStore <--- Wait, why Sp + 32? Afterall, [Sp..Sp+7] should be the previous fieldSp=Sp-8;// CmmAssignunwindSp=JustSp+56;// CmmUnwindif(R1&7!=0)gotocR5q;elsegotocR5r;// CmmCondBranch```
It turns out that the above was a red-herring. However, I do now see the problem. In particular, I noticed that in the final STG there are six occurrences of touch#. However, in the Cmm from STG there are only five occurrences of MO_Touch. After careful inspection of the code, I believe the STG of the dropped occurrence is:
Okay, I finally have a vague sense of what is going on here. Specifically, the code generation logic for handling the sPAA case is (with some adding tracing):
cgCasescrutbndralt_typealts=-- the general casedo{platform<-getPlatform;up_hp_usg<-getVirtHp-- Upstream heap usage;letret_bndrs=chooseReturnBndrsbndralt_typealtsalt_regs=map(idToRegplatform)ret_bndrs;simple_scrut<-isSimpleScrutscrutalt_type;letdo_gc|is_cmp_opscrut=False-- See Note [GC for conditionals]|notsimple_scrut=True|isSingletonalts=False|up_hp_usg>0=False|otherwise=True-- cf Note [Compiling case expressions]gc_plan=ifdo_gcthenGcInAltsalt_regselseNoGcInAlts;mb_cc<-maybeSaveCostCentresimple_scrut;letsequel=AssignToalt_regsdo_gc{- Note [scrut sequel] -};emitCommentDoc$text"cgCase(scrut)"<+>pprbndr<---wemakeithere;ret_kind<-withSequelsequel(cgExprscrut);emitCommentDoc$text"case binder:"<+>pprbndr<---thisisneveremitted;restoreCurrentCostCentremb_cc;_<-bindArgsToRegsret_bndrs;cgAlts(gc_plan,ret_kind)(NonVoidbndr)alt_typealts}
Intriguingly, I see the first trace (namely, cgCase(scrut) sat_sPAA) in the resulting Cmm yet not the second. This is quite mysterious. My vague sense is that this may be because the scrutinee of the case never converges (e.g. is an infinitely looping let-no-escape). However, I still don't quite see how this would result in the code generation not emitting the sPAA's continuation. Afterall, the FCode monad includes no explicit mechanism for early return.
Until now I have been utterly perplexed by the fact that the code generator was apparently failing to emit a continuation for the sat_sPAA case above. I spent quite a long time scratching my head at this, to the point of questioning whether I can a miscompilation of the compiler. However, it turns out that the code generator is producing Cmm for this continuation; it is the Outputable instance for CmmGraph (which preprocesses the graph with revPostorder) which drops the continuation. The fact that the Outputable instance can misrepresent the Cmm in this way is quite surprising and really should be changed, IMHO (e.g. we could print all reachable blocks in post-order and then print all remaining blocks at the end).
The continuation is dropped because it is in fact dead code since the scrutinee of the sat_sPAA case is an infinitely-looping join point. Consequently, even if the code generator generates code for the touch# continuation, it will promptly be dropped by the Cmm control flow optimisation pass, which runs before the layout stack pass which would manifest the touch# call as a stack frame.
Here is a minimal reproducer which indeed appears to circumvent keepAlive#:
{-# LANGUAGE MagicHash #-}{-# LANGUAGE UnboxedTuples #-}moduleHiwhereimportGHC.IOimportGHC.ExtsimportControl.Monadf::a->IO()fx=IO(\s0->keepAlive#xs0(lety=yiny))
Somewhat surprisingly, IO (\s0 -> keepAlive# x s0 (unIO $ forever $ return ())) doesn't reproduce the issue despite the fact that the original xmonad reproducer uses forever.
So, the question is what should be done about this. I have long been wary of the current implementation of keepAlive#, which desugars to touch# during CorePrep (Option B as described on the wiki page). At the time I thought that this would be safe on account of the very minimal optimisation that we perform in the STG pipeline. In hindsight, this was quite naive: as we see here, join points can produce programs with unreachable blocks which the Cmm pipeline is perfectly capable of dropping.
I believe that the best way around this is to propagate keepAlive# down through STG and handle it explicitly in the code generator by pushing a stack frame (Option C). I do vaguely recall attempting this when I worked on this last and finding that it was not straightforward but sadly I think we have no alternative.