segfault after several days of running program
Summary
We have a long-running distributed grpc service on the 3-node cluster, with a benchmark client on the other node. At first, all things are OK. But after two days of benchmarking, there is one server segfault. I'm sure there is enough hardware resource remaining. The node's CPU and memory usage are about 50%, according to grafana.
Steps to reproduce
Unfortunately, there is no easy way to reproduce :(
Here is the backtrace by gdb:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 evacuate1 (p=p@entry=0x422460b788) at includes/rts/storage/ClosureMacros.h:60
60 includes/rts/storage/ClosureMacros.h: No such file or directory.
[Current thread is 1 (Thread 0x7ff7f67fc700 (LWP 157))]
(gdb) bt
#0 evacuate1 (p=p@entry=0x422460b788) at includes/rts/storage/ClosureMacros.h:60
#1 0x000000000042c16c in scavenge_block1 (bd=0x42246002c0) at rts/sm/Scav.c:600
#2 0x00000000032ac1a5 in scavenge_find_work () at rts/sm/Scav.c:2130
#3 scavenge_loop1 () at rts/sm/Scav.c:2177
#4 0x00000000032937c0 in scavenge_until_all_done () at rts/sm/GC.c:1316
#5 0x0000000003294ee0 in GarbageCollect (collect_gen=<optimized out>, collect_gen@entry=0, do_heap_census=do_heap_census@entry=false, is_overflow_gc=is_overflow_gc@entry=true,
deadlock_detect=deadlock_detect@entry=false, gc_type=gc_type@entry=1, cap=cap@entry=0x4291aa0, idle_cap=<optimized out>) at rts/sm/GC.c:551
#6 0x000000000327f43d in scheduleDoGC (pcap=pcap@entry=0x7ff7f67f92e8, task=task@entry=0x7ff78c4e7870, force_major=force_major@entry=false, is_overflow_gc=<optimized out>,
deadlock_detect=deadlock_detect@entry=false) at rts/Schedule.c:1860
#7 0x000000000328001e in schedule (initialCapability=initialCapability@entry=0x423a400, task=task@entry=0x7ff78c4e7870) at rts/Schedule.c:579
#8 0x0000000003280c71 in scheduleWorker (cap=cap@entry=0x423a400, task=task@entry=0x7ff78c4e7870) at rts/Schedule.c:2645
#9 0x0000000003287b07 in workerStart (task=0x7ff78c4e7870) at rts/Task.c:445
#10 0x00007ff80bbcc609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#11 0x00007ff80b6c4133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
However, this does not seem to be of much help. Are there any tools that I can debug with a long-running server rts segfault? (I haven't really used rr
, does it helps?) I'd be happy to provide more details or do some debugging steps if anyone needs it. Many thanks.
Expected behavior
No segfault
Environment
- GHC version used: 9.2.5
- Operating System: Docker with Linux host
- System Architecture: x86_64
- Hardware resource: 8c32g
RTS info:
[("GHC RTS", "YES")
,("GHC version", "9.2.5")
,("RTS way", "rts_thr")
,("Build platform", "x86_64-unknown-linux")
,("Build architecture", "x86_64")
,("Build OS", "linux")
,("Build vendor", "unknown")
,("Host platform", "x86_64-unknown-linux")
,("Host architecture", "x86_64")
,("Host OS", "linux")
,("Host vendor", "unknown")
,("Target platform", "x86_64-unknown-linux")
,("Target architecture", "x86_64")
,("Target OS", "linux")
,("Target vendor", "unknown")
,("Word size", "64")
,("Compiler unregisterised", "NO")
,("Tables next to code", "YES")
,("Flag -with-rtsopts", "-N -A64m -n4m -qg -qn1")
]