`DontRetainCAFs` for updating results in segmentation faults

Summary

I have been trying to add unloading support to my dynamic software updating system by using the DontRetainCAFs option for GHCi's object code linker, but this often resulting in crashes.

My DSU system uses GHC's API in a similar way to GHCi in order to load in different versions of programs. This works well, but performing many updates will result in the memory usage increasing, as GHC's garbage collector will not unload Haskell object files that GHC's runtime loader has linked under normal conditions. This is a result of initObjLinker being given the RetainCAFs option. To add unloading support, my DSU system can now instead use the DontRetainCAFs option. This gives the desired behaviour, but sometimes segmentation faults and other crashes can occur before the DSU system has even requested that an object file is unloaded.

To debug this, I have created a branch of my project's repository containing two versions of a very simple program. It also contains a demo executable that updates the first version of program to the next version, with unloading enabled. When I run this demo with a very small nursery size, a segmentation fault reliably occurs. This does not occur when unloading is not enabled. Due to the design of my DSU system, the only difference that enabling unloading makes in this situation is using DontRetainCAFs. In this demo, I also give every package (including external packages, using Cabal/Stack) the following flags: -rdynamic -fwhole-archive-hs-libs -fkeep-cafs. Most of these flags can also be found in a Cabal file in ghc-hotswap, described in "Hotswapping Haskell" by Jon Coens.

-fwhole-archive-hs-libs and -rdynamic ensure that all Haskell symbol files are exported by object files such that they can be used by arbitrarily loaded object files.
fkeep-cafs stops CAFs being garbage collected. This should achieve the same effect as HotswapMain.c in ghc-hotswap. It feels like adding this option globally should result in changing RetainCAFs to DontRetainCAFs having no effect, but this is not happening.

Most of the relevant code is found in lowarn-runtime's Linker module. In my demo, load is called twice. Enabling unloading does not change the sets of loaded or unloaded object files in this case, as the second set of object files that are loaded is a superset of the first.

I have been struggling to figure out what the exact cause of the issue is for a while. Perhaps my understanding of CAFs is flawed and my approach to unloading is impossible, but I feel something must be correct, as my method does keep the memory usage of successive updates eventually constant when it doesn't run into segmentation faults. However, I have been unable to determine any way to reliably avoid these segmentation faults.

Steps to reproduce

Clone the reproduction branch of this repository: https://github.com/jonathanjameswatson/lowarn/tree/reproduction and use either Stack or Cabal:

Stack

stack build
stack exec reproduction-exe

Cabal

cabal build all
LOWARN_PACKAGE_ENV=$(realpath .ghc.environment*) cabal run reproduction-exe

In either case, the nursery size can be minimised by adding +RTS -A8k -RTS to the end of the command (and -- before reproduction-exe).

There seem to be many errors that can be given, including a segmentation fault.

Expected behavior

Version 1 of the program is linked in, immediately starts, and then ends, passing the state "a" to the next version of the program.

The update is linked in and applied. This converts the old types to the new types. Version 2, included in the update package, starts with the new state, and does not crash.

This behaviour can be given by changing line 34 of the main file of the demo's executable to runRuntime runtime True, disabling unloading.

Environment

GHC version used: 9.2.7

Optional:

Operating System: Ubuntu 20.04.4 LTS on WSL2 for Windows 11
System Architecture: AMD64

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information