Compiled program with GHC9.2.2 often causes "internal error: evacuate: strange closure type xxx"

added needs triage label

Oh dear, this indeed sounds quite bad. It's possible that this is a memory safety issue in some library but without a reproducer it is hard to say with certainty. How frequently does this crash occur? I would suggest the following

recompile with debug information enabled (using cabal's --enable-debug-info=3 flag)
recompile against the debug RTS (using GHC's -debug flag; e.g. ghc-options: -debug in your .cabal file)
run your program with +RTS -DS
gather core dump (or even better, an rr trace)

removed needs triage label

added Phighest Tbug needs triage runtime crash labels

added info needed label and removed Phighest Tbug runtime crash labels

changed milestone to %9.2.4

assigned to @bgamari

I'll work on it when I have time. I have my main repo reverted to GHC 9.0.2, so it would take some time for me to test and see how it goes.

By the way, many people seem to have similar problems on GHC>=9.1 specifically: issue on xmonad repo.

I am also sure that it is library issue, but cannot be sure until it is confirmed, right?

I'll work on it when I have time. I have my main repo reverted to GHC 9.0.2, so it would take some time for me to test and see how it goes.

Understandable. There are few things worse than a crashing window manager.

I am also sure that it is library issue, but cannot be sure until it is confirmed, right?

Indeed.

I will try to reproduce this locally as well.

I can reproduce this (well, at least some kind of sgfault)... sometimes. Here are some random points that may or may not be helpful:

First and foremost, here is a core dump of a typical crash:

click

Jun 16 12:51:13 nix-box kernel: xmonad-x86_64-l[45674]: segfault at 0 ip 00000000008496e4 sp 00007ffd3f9fc660 error 4 in xmonad-x86_64-linux[400000+4c9000] Jun 16 12:51:13 nix-box kernel: Code: 07 b8 01 00 00 00 eb 05 b8 00 00 00 00 83 e0 01 c9 c3 55 48 89 e5 48 83 ec 20 48 89 7d e8 48 8b 45 e8 48 89 c7 e8 52 ff ff ff <48> 8b 00 48 89 45 f8 48 8b 45 f8 48 89 c7 e8 95 ff ff ff c9 c3 55 Jun 16 12:51:13 nix-box systemd[1]: Started Process Core Dump (PID 45922/UID 0). Jun 16 12:51:13 nix-box [45924]: Could not parse number of program headers from core file: invalid `Elf' handle Jun 16 12:51:13 nix-box systemd-coredump[45923]: [🡕] Process 45674 (xmonad-x86_64-l) of user 1000 dumped core.

  Module /home/slot/.dotfiles/xmonad/.config/xmonad/xmonad-x86_64-linux without build-id.
  Module linux-vdso.so.1 with build-id 2fa41a0752680fe0154ff16836eb27288b1228f9
  Module libXdmcp.so.6 without build-id.
  Module libXau.so.6 without build-id.
  Module libexpat.so.1 without build-id.
  Module ld-linux-x86-64.so.2 with build-id 33129cc93ec0791b7205f8d6bfee7b9b401c7575
  Module libxcb.so.1 without build-id.
  Module libXrender.so.1 without build-id.
  Module libpthread.so.0 with build-id 78ce1acfc3ecc151e20784512f5e6159ef123295
  Module libz.so.1 without build-id.
  Module libpng16.so.16 without build-id.
  Module libbz2.so.1 without build-id.
  Module libfontconfig.so.1 without build-id.
  Module libffi.so.8 without build-id.
  Module libdl.so.2 with build-id 996db863a8a94ece0b3b647fb439cbef036b87a5
  Module librt.so.1 with build-id 343641518055bf6b8d03121334efbb6217989ce3
  Module libm.so.6 with build-id b8bb4e6ba491f8635d3f954347368814bf883a9d
  Module libc.so.6 with build-id dbf06af5b7eff3f5f00bbc3059f0a1846d4d18c9
  Module libgmp.so.10 without build-id.
  Module libXext.so.6 without build-id.
  Module libXrandr.so.2 without build-id.
  Module libX11.so.6 without build-id.
  Module libXss.so.1 without build-id.
  Module libfreetype.so.6 without build-id.
  Module libXft.so.2 without build-id.
  Stack trace of thread 45674:
  #0  0x00000000008496e4 n/a (/home/slot/.dotfiles/xmonad/.config/xmonad/xmonad-x86_64-linux + 0x4496e4)
  #1  0x0000000000849f14 n/a (/home/slot/.dotfiles/xmonad/.config/xmonad/xmonad-x86_64-linux + 0x449f14)
  #2  0x000000000084a5ab n/a (/home/slot/.dotfiles/xmonad/.config/xmonad/xmonad-x86_64-linux + 0x44a5ab)
  #3  0x000000000084b28e n/a (/home/slot/.dotfiles/xmonad/.config/xmonad/xmonad-x86_64-linux + 0x44b28e)
  #4  0x000000000084b380 n/a (/home/slot/.dotfiles/xmonad/.config/xmonad/xmonad-x86_64-linux + 0x44b380)
  #5  0x000000000084b3fb n/a (/home/slot/.dotfiles/xmonad/.config/xmonad/xmonad-x86_64-linux + 0x44b3fb)
  #6  0x000000000083dac3 n/a (/home/slot/.dotfiles/xmonad/.config/xmonad/xmonad-x86_64-linux + 0x43dac3)
  #7  0x000000000082dddb n/a (/home/slot/.dotfiles/xmonad/.config/xmonad/xmonad-x86_64-linux + 0x42dddb)
  #8  0x000000000082d32c n/a (/home/slot/.dotfiles/xmonad/.config/xmonad/xmonad-x86_64-linux + 0x42d32c)
  #9  0x000000000082e7db n/a (/home/slot/.dotfiles/xmonad/.config/xmonad/xmonad-x86_64-linux + 0x42e7db)
  #10 0x0000000000827380 n/a (/home/slot/.dotfiles/xmonad/.config/xmonad/xmonad-x86_64-linux + 0x427380)
  #11 0x000000000082aa27 n/a (/home/slot/.dotfiles/xmonad/.config/xmonad/xmonad-x86_64-linux + 0x42aa27)
  #12 0x000000000042388f n/a (/home/slot/.dotfiles/xmonad/.config/xmonad/xmonad-x86_64-linux + 0x2388f)
  #13 0x00007f508ccd21d7 __libc_start_call_main (libc.so.6 + 0x2d1d7)
  #14 0x00007f508ccd2297 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x2d297)
  #15 0x000000000040d591 n/a (/home/slot/.dotfiles/xmonad/.config/xmonad/xmonad-x86_64-linux + 0xd591)
  ELF object binary architecture: AMD x86-64

Jun 16 12:51:13 nix-box systemd[1]: systemd-coredump@1-45922-0.service: Deactivated successfully.

Note the curious absence of a "strange closure type" error. They do look very similar to the ones posted in the xmonad issue though.

The crash always happens when doing something involving xft fonts—which, I guess, the core dump also indicates. This means that, in particular, I can't reproduce it with a bare xmonad config. So far, I've only managed to make it crash with my personal config, which uses xft fonts for e.g. the prompt from xmonad-contrib. It also feels very random: I can't reliably reproduce it with any certain sequence of keys.
Funnily enough, even though I really tried, I can't reproduce it at all once I start compiling with debugging information or run xmonad under rr. I once had this before—another program that would crash but then not anymore when run with strace—and that time it was related to a race condition that wasn't triggered, since the debugging information made the execution just that tiny bit slower.

Sorry if none of this was particularly helpful. I will keep trying for a minimal reproducer.

Thanks this is useful @slotThe, which GHC version were you using?

I tried (and can reproduce this) with both 9.2.3 and 9.2.2. It was definitely fine with 9.0.x and below, so I suppose I could try to bisect this (would be a long weekend though, that's for sure, especially with me still not having a reliable reproducer). I can also try with HEAD at some point, if that'd be helpful.

@slotThe Perhaps you could produce a reproducer which just depends on X11=xft?

added Phighest RTS Tbug incorrect runtime result labels and removed info needed needs triage labels

The XMonad issue suggests that this might be a library error in X11-xft.

I have reproduced this with bgamari/t21708-repro>. The trick appears to be building with -threaded and using +RTS -N > 1. In particular, things crash quite reliably with -N8. The problem appears to be that XNextEvent is trampling on live heap objects.

The problem appears to be similar to that seen in #14346 (closed). Specifically, the main loop of xmonad is [XMonad.Main.launch](https://github.com/xmonad/xmonad/blob/9189d002dd8ed369b822b10dcaae4bb66d068670/src/XMonad/Main.hs#L233) which allocates a pinned ByteArray#buffer (to hold an X11 event) and then repeatedly callsXNextEventto fill it in aforever` loop.

Intriguingly, it appears that this event buffer is being inappropriately collected by the GC. The buffer is here:

(rr) x/80a 0x42023063e0-16
0x42023063d0:	0x6d06e0 <stg_ARR_WORDS_info>	0xc0
0x42023063e0:	0xaaaaaaaaaaaaaaaa	0xaaaaaaaaaaaaaaaa
0x42023063f0:	0xaaaaaaaaaaaaaaaa	0xaaaaaaaaaaaaaaaa

When the collection occurs, the main thread's stack looks like:

(rr) x/80a $main_tso->stackobj->sp
0x42024f9ec0:	0x6c9ca0 <stg_enter_info>	0x42024e4873
0x42024f9ed0:	0x6d2370 <stg_ap_p_info>	0x42024e4881
0x42024f9ee0:	0x6c7da8 <stg_marked_upd_frame_info>	0x42024e4978
0x42024f9ef0:	0x6d2700 <stg_ap_pv_info>	0x42024e4989
0x42024f9f00:	0x427210	0x435d70
0x42024f9f10:	0x73a2b0	0x73cc18
0x42024f9f20:	0x74a459	0x42024e49f8
0x42024f9f30:	0x42024e4881	0x487830 <ss4k_info+73>
0x42024f9f40:	0x73a300	0x73a1c0
0x42024f9f50:	0x6c8658 <stg_catch_frame_info>	0x0
0x42024f9f60:	0x42024e4a22	0x482c40 <xmonadzm0zi17zi0zi9zminplace_XMonadziCore_catchX1_info+137>
0x42024f9f70:	0x4bcfb8 <sRAw_info+81>	0x42020fb13a
0x42024f9f80:	0x42024e4881	0x4bd418 <xmonadzm0zi17zi0zi9zminplace_XMonadziOperations_broadcastMessage1_info+121>
0x42024f9f90:	0x74ab00	0x42024e4a47
0x42024f9fa0:	0x42024e4881	0x42024e4a8a
0x42024f9fb0:	0x4a5670 <xmonadzm0zi17zi0zi9zminplace_XMonadziMain_launch1_info+7241>	0x12e0390
0x42024f9fc0:	0x42024e4aa1	0x42023063e0
0x42024f9fd0:	0x42024e4b9a	0x42024e4ba9
0x42024f9fe0:	0x6c8658 <stg_catch_frame_info>	0x0
0x42024f9ff0:	0x77b5ca	0x6c78f8 <stg_stop_thread_info>

Note the 0x42023063e0 pointer in the launch1_info stack frame; this is the address of the ByteArray#'s payload, not the ByteArray# itself. Consequently, at this point there is nothing keeping the ByteArray# alive. This is quite strange since if we look at the simplified Core of launch1 we see that all uses of the ByteArray# are enclosed in a keepAlive#. Moreover, looking at the STG we see that this keepAlive# was correctly lowered to a touch# on the ByteArray#. I suppose the code generator's free variable analysis must somehow be concluding that the ByteArray# isn't reachable despite the touch#.

xmonadzm0zi17zi0zi9zminplace_XMonadziMain_launch1_info+7241 appears to be the block chON, which is the continuation of the call to prehandle with a stack layout:

                       (chON,
                        label: block_chON_info
                        rep: StackRep [True, False, True, False, False]
                        srt: Just _uicE_srt),

Indeed the second field being false implies that that it's not a GC pointer.

Well this is curious: the code after the allocation of the ByteArray# in question is:

       cR5k:
           // ...
           R1 = 192;   // CmmAssign                                       <---- this is indeed the right size for an XEvent
           // ...
           call stg_newPinnedByteArray#(R1) returns to cR53, args: 8, res: 8, upd: 8;   // CmmCall
       cR53:
           unwind Sp = Just Sp + 48;   // CmmUnwind                        <---- at this point R1 is the ByteArray#
           // case binder: ds61_sPi9
           // case binder: ds62_sPic
           // case binder: ev_sPif
           _sPeV::I64 = I64[Sp + 16];   // CmmAssign
           _sPi5::P64 = P64[Sp + 32];   // CmmAssign
           _sPif::I64 = R1 + 16;   // CmmAssign                            <--- this is the byteArrayContents result
           I32^[_sPif::I64] = 22 :: W32;   // CmmStore
           // case binder: s2_sPig
           // case binder: sat_sPih
           I64^[_sPif::I64 + 32] = _sPeV::I64;   // CmmStore
           // case binder: s3_sPii
           // case binder: sat_sPij
           I64^[_sPif::I64 + 40] = _sPeV::I64;   // CmmStore
           // case binder: s4_sPik
           I64[Sp - 8] = cR5q;   // CmmStore                              <--- Here we start constructing a continuation stack frame
           _sPie::P64 = R1;   // CmmAssign                                <--- _sPie  is the ByteArray#
           R1 = _sPi5::P64;   // CmmAssign
           I64[Sp] = _sPif::I64;   // CmmStore                            <--- here we are pushing a continuation
           P64[Sp + 32] = _sPie::P64;   // CmmStore                       <--- Wait, why Sp + 32? Afterall, [Sp..Sp+7] should be the previous field
           Sp = Sp - 8;   // CmmAssign
           unwind Sp = Just Sp + 56;   // CmmUnwind
           if (R1 & 7 != 0) goto cR5q; else goto cR5r;   // CmmCondBranch```

It turns out that the above was a red-herring. However, I do now see the problem. In particular, I noticed that in the final STG there are six occurrences of touch#. However, in the Cmm from STG there are only five occurrences of MO_Touch. After careful inspection of the code, I believe the STG of the dropped occurrence is:

case ... of sat_sPAA [Occ=Once1] {
    Solo# us_gPH9 ->
        case touch# [ipv29_sOw4 GHC.Prim.realWorld#] of sat_sPAB [Occ=Dead] {
            (##) -> Solo# [us_gPH9];
        };
}

I haven't yet pinned down what is different about this particular touch# usage such that the code generator drops it.

Okay, I finally have a vague sense of what is going on here. Specifically, the code generation logic for handling the sPAA case is (with some adding tracing):

cgCase scrut bndr alt_type alts
  = -- the general case
    do { platform <- getPlatform
       ; up_hp_usg <- getVirtHp        -- Upstream heap usage
       ; let ret_bndrs = chooseReturnBndrs bndr alt_type alts
             alt_regs  = map (idToReg platform) ret_bndrs
       ; simple_scrut <- isSimpleScrut scrut alt_type
       ; let do_gc  | is_cmp_op scrut  = False  -- See Note [GC for conditionals]
                    | not simple_scrut = True
                    | isSingleton alts = False
                    | up_hp_usg > 0    = False
                    | otherwise        = True
               -- cf Note [Compiling case expressions]
             gc_plan = if do_gc then GcInAlts alt_regs else NoGcInAlts

       ; mb_cc <- maybeSaveCostCentre simple_scrut

       ; let sequel = AssignTo alt_regs do_gc{- Note [scrut sequel] -}
       ; emitCommentDoc $ text "cgCase(scrut)" <+> ppr bndr                   <--- we make it here
       ; ret_kind <- withSequel sequel (cgExpr scrut)
       ; emitCommentDoc $ text "case binder:" <+> ppr bndr                    <--- this is never emitted
       ; restoreCurrentCostCentre mb_cc
       ; _ <- bindArgsToRegs ret_bndrs
       ; cgAlts (gc_plan,ret_kind) (NonVoid bndr) alt_type alts
       }

Intriguingly, I see the first trace (namely, cgCase(scrut) sat_sPAA) in the resulting Cmm yet not the second. This is quite mysterious. My vague sense is that this may be because the scrutinee of the case never converges (e.g. is an infinitely looping let-no-escape). However, I still don't quite see how this would result in the code generation not emitting the sPAA's continuation. Afterall, the FCode monad includes no explicit mechanism for early return.

Until now I have been utterly perplexed by the fact that the code generator was apparently failing to emit a continuation for the sat_sPAA case above. I spent quite a long time scratching my head at this, to the point of questioning whether I can a miscompilation of the compiler. However, it turns out that the code generator is producing Cmm for this continuation; it is the Outputable instance for CmmGraph (which preprocesses the graph with revPostorder) which drops the continuation. The fact that the Outputable instance can misrepresent the Cmm in this way is quite surprising and really should be changed, IMHO (e.g. we could print all reachable blocks in post-order and then print all remaining blocks at the end).

The continuation is dropped because it is in fact dead code since the scrutinee of the sat_sPAA case is an infinitely-looping join point. Consequently, even if the code generator generates code for the touch# continuation, it will promptly be dropped by the Cmm control flow optimisation pass, which runs before the layout stack pass which would manifest the touch# call as a stack frame.

Thanks to Andreas for playing rubber ducky.

Here is a minimal reproducer which indeed appears to circumvent keepAlive#:

{-# LANGUAGE MagicHash #-}
{-# LANGUAGE UnboxedTuples #-}

module Hi where

import GHC.IO
import GHC.Exts
import Control.Monad

f :: a -> IO ()
f x = IO (\s0 -> keepAlive# x s0 (let y = y in y))

Somewhat surprisingly, IO (\s0 -> keepAlive# x s0 (unIO $ forever $ return ())) doesn't reproduce the issue despite the fact that the original xmonad reproducer uses forever.

So, the question is what should be done about this. I have long been wary of the current implementation of keepAlive#, which desugars to touch# during CorePrep (Option B as described on the wiki page). At the time I thought that this would be safe on account of the very minimal optimisation that we perform in the STG pipeline. In hindsight, this was quite naive: as we see here, join points can produce programs with unreachable blocks which the Cmm pipeline is perfectly capable of dropping.

I believe that the best way around this is to propagate keepAlive# down through STG and handle it explicitly in the code generator by pushing a stack frame (Option C). I do vaguely recall attempting this when I worked on this last and finding that it was not straightforward but sadly I think we have no alternative.

Compiled program with GHC9.2.2 often causes "internal error: evacuate: strange closure type xxx"

Summary

Steps to reproduce

Expected behavior

Environment

Child items ...

Activity

Compiled program with GHC9.2.2 often causes "internal error: evacuate: strange closure type xxx"

Summary

Steps to reproduce

Expected behavior

Environment

Relates to

Activity