`DontRetainCAFs` for updating results in segmentation faults
Summary
I have been trying to add unloading support to my dynamic software updating system by using the DontRetainCAFs
option for GHCi's object code linker, but this often resulting in crashes.
My DSU system uses GHC's API in a similar way to GHCi in order to load in different versions of programs. This works well, but performing many updates will result in the memory usage increasing, as GHC's garbage collector will not unload Haskell object files that GHC's runtime loader has linked under normal conditions. This is a result of initObjLinker
being given the RetainCAFs
option. To add unloading support, my DSU system can now instead use the DontRetainCAFs
option. This gives the desired behaviour, but sometimes segmentation faults and other crashes can occur before the DSU system has even requested that an object file is unloaded.
To debug this, I have created a branch of my project's repository containing two versions of a very simple program. It also contains a demo executable that updates the first version of program to the next version, with unloading enabled. When I run this demo with a very small nursery size, a segmentation fault reliably occurs. This does not occur when unloading is not enabled. Due to the design of my DSU system, the only difference that enabling unloading makes in this situation is using DontRetainCAFs
. In this demo, I also give every package (including external packages, using Cabal/Stack) the following flags: -rdynamic -fwhole-archive-hs-libs -fkeep-cafs
. Most of these flags can also be found in a Cabal file in ghc-hotswap
, described in "Hotswapping Haskell" by Jon Coens.
-
-fwhole-archive-hs-libs
and-rdynamic
ensure that all Haskell symbol files are exported by object files such that they can be used by arbitrarily loaded object files. -
fkeep-cafs
stops CAFs being garbage collected. This should achieve the same effect asHotswapMain.c
inghc-hotswap
. It feels like adding this option globally should result in changingRetainCAFs
toDontRetainCAFs
having no effect, but this is not happening.
Most of the relevant code is found in lowarn-runtime
's Linker
module. In my demo, load
is called twice. Enabling unloading does not change the sets of loaded or unloaded object files in this case, as the second set of object files that are loaded is a superset of the first.
I have been struggling to figure out what the exact cause of the issue is for a while. Perhaps my understanding of CAFs is flawed and my approach to unloading is impossible, but I feel something must be correct, as my method does keep the memory usage of successive updates eventually constant when it doesn't run into segmentation faults. However, I have been unable to determine any way to reliably avoid these segmentation faults.
Steps to reproduce
Clone the reproduction
branch of this repository: https://github.com/jonathanjameswatson/lowarn/tree/reproduction and use either Stack or Cabal:
Stack
stack build
stack exec reproduction-exe
Cabal
cabal build all
LOWARN_PACKAGE_ENV=$(realpath .ghc.environment*) cabal run reproduction-exe
In either case, the nursery size can be minimised by adding +RTS -A8k -RTS
to the end of the command (and --
before reproduction-exe
).
There seem to be many errors that can be given, including a segmentation fault.
Expected behavior
Version 1 of the program is linked in, immediately starts, and then ends, passing the state "a"
to the next version of the program.
The update is linked in and applied. This converts the old types to the new types. Version 2, included in the update package, starts with the new state, and does not crash.
This behaviour can be given by changing line 34 of the main file of the demo's executable to runRuntime runtime True
, disabling unloading.
Environment
- GHC version used: 9.2.7
Optional:
- Operating System: Ubuntu 20.04.4 LTS on WSL2 for Windows 11
- System Architecture: AMD64