segfault after several days of running program

Summary

We have a long-running distributed grpc service on the 3-node cluster, with a benchmark client on the other node. At first, all things are OK. But after two days of benchmarking, there is one server segfault. I'm sure there is enough hardware resource remaining. The node's CPU and memory usage are about 50%, according to grafana.

Steps to reproduce

Unfortunately, there is no easy way to reproduce :(

Here is the backtrace by gdb:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  evacuate1 (p=p@entry=0x422460b788) at includes/rts/storage/ClosureMacros.h:60
60      includes/rts/storage/ClosureMacros.h: No such file or directory.
[Current thread is 1 (Thread 0x7ff7f67fc700 (LWP 157))]
(gdb) bt
#0  evacuate1 (p=p@entry=0x422460b788) at includes/rts/storage/ClosureMacros.h:60
#1  0x000000000042c16c in scavenge_block1 (bd=0x42246002c0) at rts/sm/Scav.c:600
#2  0x00000000032ac1a5 in scavenge_find_work () at rts/sm/Scav.c:2130
#3  scavenge_loop1 () at rts/sm/Scav.c:2177
#4  0x00000000032937c0 in scavenge_until_all_done () at rts/sm/GC.c:1316
#5  0x0000000003294ee0 in GarbageCollect (collect_gen=<optimized out>, collect_gen@entry=0, do_heap_census=do_heap_census@entry=false, is_overflow_gc=is_overflow_gc@entry=true, 
    deadlock_detect=deadlock_detect@entry=false, gc_type=gc_type@entry=1, cap=cap@entry=0x4291aa0, idle_cap=<optimized out>) at rts/sm/GC.c:551
#6  0x000000000327f43d in scheduleDoGC (pcap=pcap@entry=0x7ff7f67f92e8, task=task@entry=0x7ff78c4e7870, force_major=force_major@entry=false, is_overflow_gc=<optimized out>, 
    deadlock_detect=deadlock_detect@entry=false) at rts/Schedule.c:1860
#7  0x000000000328001e in schedule (initialCapability=initialCapability@entry=0x423a400, task=task@entry=0x7ff78c4e7870) at rts/Schedule.c:579
#8  0x0000000003280c71 in scheduleWorker (cap=cap@entry=0x423a400, task=task@entry=0x7ff78c4e7870) at rts/Schedule.c:2645
#9  0x0000000003287b07 in workerStart (task=0x7ff78c4e7870) at rts/Task.c:445
#10 0x00007ff80bbcc609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#11 0x00007ff80b6c4133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

However, this does not seem to be of much help. Are there any tools that I can debug with a long-running server rts segfault? (I haven't really used rr, does it helps?) I'd be happy to provide more details or do some debugging steps if anyone needs it. Many thanks.

Expected behavior

No segfault

Environment

GHC version used: 9.2.5
Operating System: Docker with Linux host
System Architecture: x86_64
Hardware resource: 8c32g

RTS info:

 [("GHC RTS", "YES")
 ,("GHC version", "9.2.5")
 ,("RTS way", "rts_thr")
 ,("Build platform", "x86_64-unknown-linux")
 ,("Build architecture", "x86_64")
 ,("Build OS", "linux")
 ,("Build vendor", "unknown")
 ,("Host platform", "x86_64-unknown-linux")
 ,("Host architecture", "x86_64")
 ,("Host OS", "linux")
 ,("Host vendor", "unknown")
 ,("Target platform", "x86_64-unknown-linux")
 ,("Target architecture", "x86_64")
 ,("Target OS", "linux")
 ,("Target vendor", "unknown")
 ,("Word size", "64")
 ,("Compiler unregisterised", "NO")
 ,("Tables next to code", "YES")
 ,("Flag -with-rtsopts", "-N -A64m -n4m -qg -qn1")
 ]

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information