"Internal error: invalid closure" and seg faults with non-moving gc
Summary
I've been testing the non-moving garbage collector against a proprietary app and found that it occasionally leads to segmentation faults/assertion failures. I've done some testing with 8.10.7 and with 9.2, and I've observed the following:
With 8.10.7:
- The initial error was:
internal error: mark_closure: unimplemented/strange closure type -2003490061 @ 0x4730e283d0
- On subsequent runs I got segmentation faults around the same point. So, I recompiled with
-debug
and-g
and tried to reproduce in gdb, but the stack traces weren't very interesting. We tended to get seg faults in memcmp called from the Eq instance of a type. - One time I got, an assertion failure from this line: https://gitlab.haskell.org/ghc/ghc/-/blob/ghc-8.10/rts/Messages.c#L94 . I thought there might have been a fix for this, so I decided to try 9.2 instead.
With 9.2:
- The problem showed up much sooner in the application run and seemed to be much more consistent.
- Most often I would get the following error
internal error: invalid closure, info=0x4221c1df98
. Sometimes the info pointer is an actual address and sometimes it is 0. I get stack traces that look like this:
#0 0x00007ffff7a8c33a in raise () from /nix/store/ikl21vjfq900ccbqg1xasp83kadw6q8y-glibc-2.32-46/lib/libc.so.6
#1 0x00007ffff7a76523 in abort () from /nix/store/ikl21vjfq900ccbqg1xasp83kadw6q8y-glibc-2.32-46/lib/libc.so.6
#2 0x000000000b4e3e24 in rtsFatalInternalErrorFn (s=0x10094275 "invalid closure, info=%p", ap=0x7fffc4ff82e8) at rts/RtsMessages.c:186
#3 0x000000000b4e3a56 in barf (s=0x10094275 "invalid closure, info=%p") at rts/RtsMessages.c:48
#4 0x000000000b55660a in evacuate (p=0x42232aa888) at rts/sm/Evac.c:693
#5 0x000000000b52a718 in scavenge_block (bd=0x4223202a80) at rts/sm/Scav.c:601
#6 0x000000000b52c85b in scavenge_find_work () at rts/sm/Scav.c:2102
#7 0x000000000b52c995 in scavenge_loop () at rts/sm/Scav.c:2178
#8 0x000000000b521267 in scavenge_until_all_done () at rts/sm/GC.c:1293
#9 0x000000000b5214f9 in gcWorkerThread (cap=0x17b33b20) at rts/sm/GC.c:1381
#10 0x000000000b4efeb2 in yieldCapability (pCap=0x7fffc4ff8d60, task=0x7fffc8000bb0, gcAllowed=true) at rts/Capability.c:971
#11 0x000000000b4fde04 in scheduleYield (pcap=0x7fffc4ff8d90, task=0x7fffc8000bb0) at rts/Schedule.c:705
#12 0x000000000b4fd0af in schedule (initialCapability=0x17b33b20, task=0x7fffc8000bb0) at rts/Schedule.c:315
#13 0x000000000b500da6 in scheduleWorker (cap=0x17b33b20, task=0x7fffc8000bb0) at rts/Schedule.c:2671
#14 0x000000000b4f821d in workerStart (task=0x7fffc8000bb0) at rts/Task.c:445
#15 0x00007ffff7fb1e9e in start_thread () from /nix/store/ikl21vjfq900ccbqg1xasp83kadw6q8y-glibc-2.32-46/lib/libpthread.so.0
#16 0x00007ffff7b4b49f in clone () from /nix/store/ikl21vjfq900ccbqg1xasp83kadw6q8y-glibc-2.32-46/lib/libc.so.6
- I was eventually able to reproduce the problem with
rr
. In that specific instance the failure message is:internal error: invalid closure, info=(nil)
and the stack trace is:
#0 0x00007f2c1ea6d33a in raise () from /nix/store/ikl21vjfq900ccbqg1xasp83kadw6q8y-glibc-2.32-46/lib/libc.so.6
#1 0x00007f2c1ea57523 in abort () from /nix/store/ikl21vjfq900ccbqg1xasp83kadw6q8y-glibc-2.32-46/lib/libc.so.6
#2 0x000000000b4f1cb4 in rtsFatalInternalErrorFn (s=0x1009ed35 "invalid closure, info=%p", ap=0x7f2be99f9f78) at rts/RtsMessages.c:186
#3 0x000000000b4f18e6 in barf (s=0x1009ed35 "invalid closure, info=%p") at rts/RtsMessages.c:48
#4 0x000000000b56449a in evacuate (p=0x4200072598) at rts/sm/Evac.c:693
#5 0x000000000b55df21 in nonmovingScavengeOne (q=0x4200072590) at rts/sm/NonMovingScav.c:351
#6 0x000000000b55e0b2 in scavengeNonmovingSegment (seg=0x4200070000) at rts/sm/NonMovingScav.c:397
#7 0x000000000b53a6ae in scavenge_find_work () at rts/sm/Scav.c:2091
#8 0x000000000b53a825 in scavenge_loop () at rts/sm/Scav.c:2178
#9 0x000000000b52f0f7 in scavenge_until_all_done () at rts/sm/GC.c:1293
#10 0x000000000b52d538 in GarbageCollect (collect_gen=0, do_heap_census=false, is_overflow_gc=true, deadlock_detect=false, gc_type=2, cap=0x1895d070, idle_cap=0x7f2c000015b0)
at rts/sm/GC.c:538
#11 0x000000000b50d71f in scheduleDoGC (pcap=0x7f2be99fad90, task=0x7f2bf0000bb0, force_major=false, is_overflow_gc=true, deadlock_detect=false) at rts/Schedule.c:1886
#12 0x000000000b50b960 in schedule (initialCapability=0x1895d070, task=0x7f2bf0000bb0) at rts/Schedule.c:579
#13 0x000000000b50ec36 in scheduleWorker (cap=0x1895d070, task=0x7f2bf0000bb0) at rts/Schedule.c:2671
#14 0x000000000b5060ad in workerStart (task=0x7f2bf0000bb0) at rts/Task.c:445
#15 0x00007f2c1ef92e9e in start_thread () from /nix/store/ikl21vjfq900ccbqg1xasp83kadw6q8y-glibc-2.32-46/lib/libpthread.so.0
#16 0x00007f2c1eb2c49f in clone () from /nix/store/ikl21vjfq900ccbqg1xasp83kadw6q8y-glibc-2.32-46/lib/libc.so.6
I tried following the problematic closure 0x4200072598
around and it seems to be a CAF that gets forced early on, then it is copied by a GC and it isn't touched until the crash above.
I'll try to get more data. Let me know if anything in particular would be helpful
Steps to reproduce
Unfortunately I can't share the executable that is causing this crash.
Expected behaviour
Seg faults/assertion errors shouldn't occur.
Environment
- GHC version used: The Glorious Glasgow Haskell Compilation System, version 9.2.0.20210910
Optional:
- Operating System: Linux
- System Architecture: x86_64
Edited by Teo Camarasu