"Internal error: invalid closure" and seg faults with non-moving gc

Summary

I've been testing the non-moving garbage collector against a proprietary app and found that it occasionally leads to segmentation faults/assertion failures. I've done some testing with 8.10.7 and with 9.2, and I've observed the following:

With 8.10.7:

The initial error was: internal error: mark_closure: unimplemented/strange closure type -2003490061 @ 0x4730e283d0
On subsequent runs I got segmentation faults around the same point. So, I recompiled with -debug and -g and tried to reproduce in gdb, but the stack traces weren't very interesting. We tended to get seg faults in memcmp called from the Eq instance of a type.
One time I got, an assertion failure from this line: https://gitlab.haskell.org/ghc/ghc/-/blob/ghc-8.10/rts/Messages.c#L94 . I thought there might have been a fix for this, so I decided to try 9.2 instead.

With 9.2:

The problem showed up much sooner in the application run and seemed to be much more consistent.
Most often I would get the following error internal error: invalid closure, info=0x4221c1df98. Sometimes the info pointer is an actual address and sometimes it is 0. I get stack traces that look like this:

#0  0x00007ffff7a8c33a in raise () from /nix/store/ikl21vjfq900ccbqg1xasp83kadw6q8y-glibc-2.32-46/lib/libc.so.6
#1  0x00007ffff7a76523 in abort () from /nix/store/ikl21vjfq900ccbqg1xasp83kadw6q8y-glibc-2.32-46/lib/libc.so.6
#2  0x000000000b4e3e24 in rtsFatalInternalErrorFn (s=0x10094275 "invalid closure, info=%p", ap=0x7fffc4ff82e8) at rts/RtsMessages.c:186
#3  0x000000000b4e3a56 in barf (s=0x10094275 "invalid closure, info=%p") at rts/RtsMessages.c:48
#4  0x000000000b55660a in evacuate (p=0x42232aa888) at rts/sm/Evac.c:693
#5  0x000000000b52a718 in scavenge_block (bd=0x4223202a80) at rts/sm/Scav.c:601
#6  0x000000000b52c85b in scavenge_find_work () at rts/sm/Scav.c:2102
#7  0x000000000b52c995 in scavenge_loop () at rts/sm/Scav.c:2178
#8  0x000000000b521267 in scavenge_until_all_done () at rts/sm/GC.c:1293
#9  0x000000000b5214f9 in gcWorkerThread (cap=0x17b33b20) at rts/sm/GC.c:1381
#10 0x000000000b4efeb2 in yieldCapability (pCap=0x7fffc4ff8d60, task=0x7fffc8000bb0, gcAllowed=true) at rts/Capability.c:971
#11 0x000000000b4fde04 in scheduleYield (pcap=0x7fffc4ff8d90, task=0x7fffc8000bb0) at rts/Schedule.c:705
#12 0x000000000b4fd0af in schedule (initialCapability=0x17b33b20, task=0x7fffc8000bb0) at rts/Schedule.c:315
#13 0x000000000b500da6 in scheduleWorker (cap=0x17b33b20, task=0x7fffc8000bb0) at rts/Schedule.c:2671
#14 0x000000000b4f821d in workerStart (task=0x7fffc8000bb0) at rts/Task.c:445
#15 0x00007ffff7fb1e9e in start_thread () from /nix/store/ikl21vjfq900ccbqg1xasp83kadw6q8y-glibc-2.32-46/lib/libpthread.so.0
#16 0x00007ffff7b4b49f in clone () from /nix/store/ikl21vjfq900ccbqg1xasp83kadw6q8y-glibc-2.32-46/lib/libc.so.6

I was eventually able to reproduce the problem with rr. In that specific instance the failure message is: internal error: invalid closure, info=(nil) and the stack trace is:

#0  0x00007f2c1ea6d33a in raise () from /nix/store/ikl21vjfq900ccbqg1xasp83kadw6q8y-glibc-2.32-46/lib/libc.so.6
#1  0x00007f2c1ea57523 in abort () from /nix/store/ikl21vjfq900ccbqg1xasp83kadw6q8y-glibc-2.32-46/lib/libc.so.6
#2  0x000000000b4f1cb4 in rtsFatalInternalErrorFn (s=0x1009ed35 "invalid closure, info=%p", ap=0x7f2be99f9f78) at rts/RtsMessages.c:186
#3  0x000000000b4f18e6 in barf (s=0x1009ed35 "invalid closure, info=%p") at rts/RtsMessages.c:48
#4  0x000000000b56449a in evacuate (p=0x4200072598) at rts/sm/Evac.c:693
#5  0x000000000b55df21 in nonmovingScavengeOne (q=0x4200072590) at rts/sm/NonMovingScav.c:351
#6  0x000000000b55e0b2 in scavengeNonmovingSegment (seg=0x4200070000) at rts/sm/NonMovingScav.c:397
#7  0x000000000b53a6ae in scavenge_find_work () at rts/sm/Scav.c:2091
#8  0x000000000b53a825 in scavenge_loop () at rts/sm/Scav.c:2178
#9  0x000000000b52f0f7 in scavenge_until_all_done () at rts/sm/GC.c:1293
#10 0x000000000b52d538 in GarbageCollect (collect_gen=0, do_heap_census=false, is_overflow_gc=true, deadlock_detect=false, gc_type=2, cap=0x1895d070, idle_cap=0x7f2c000015b0)
    at rts/sm/GC.c:538
#11 0x000000000b50d71f in scheduleDoGC (pcap=0x7f2be99fad90, task=0x7f2bf0000bb0, force_major=false, is_overflow_gc=true, deadlock_detect=false) at rts/Schedule.c:1886
#12 0x000000000b50b960 in schedule (initialCapability=0x1895d070, task=0x7f2bf0000bb0) at rts/Schedule.c:579
#13 0x000000000b50ec36 in scheduleWorker (cap=0x1895d070, task=0x7f2bf0000bb0) at rts/Schedule.c:2671
#14 0x000000000b5060ad in workerStart (task=0x7f2bf0000bb0) at rts/Task.c:445
#15 0x00007f2c1ef92e9e in start_thread () from /nix/store/ikl21vjfq900ccbqg1xasp83kadw6q8y-glibc-2.32-46/lib/libpthread.so.0
#16 0x00007f2c1eb2c49f in clone () from /nix/store/ikl21vjfq900ccbqg1xasp83kadw6q8y-glibc-2.32-46/lib/libc.so.6

I tried following the problematic closure 0x4200072598 around and it seems to be a CAF that gets forced early on, then it is copied by a GC and it isn't touched until the crash above.

I'll try to get more data. Let me know if anything in particular would be helpful

Steps to reproduce

Unfortunately I can't share the executable that is causing this crash.

Expected behaviour

Seg faults/assertion errors shouldn't occur.

Environment

GHC version used: The Glorious Glasgow Haskell Compilation System, version 9.2.0.20210910

Optional:

Operating System: Linux
System Architecture: x86_64

Edited Sep 21, 2021 by Teo Camarasu

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information