CAF reversion is generally broken

While looking at #22417 (closed) @AndreasK and I realized that CAF reversion is at best fragile and has been for quite some time.

In short, the debug RTS assumes that it can clobber CAFs' saved_info field in newCAF. A nearby comment offers this explanation for why this reuse is safe:

The saved_info field of the CAF is used as the link field for debug_caf_list, because this field is only used by newDynCAF for revertible CAFs, and we don't put those on the debug_caf_list.

Note that newDynCAF mentioned above no longer exists but now rather seems to be called newRetainedCAF. This overloading of saved_info is fine in the case that static linking is used, newRetainedCAF is used in place of newCAF when a CAF from dynamically-loaded code is entered. newRetainedCAF will then add the CAF to revertible_caf_list and lockCAF will preserve the CAF's original info table in its saved_info field. CAFs on revertible_caf_list are treated as roots by the GC. When revertCAFs is called we walk revertible_caf_list and reset the info pointer of each to caf->saved_info.

However, in the case that dynamic linking is used (as it is on most platforms at this point) we cannot redirect calls to newCAF to newRetainedCAF in loaded code (since we have no control over the dynamic loader's symbol resolution behavior). Consequently, we have a hack in newCAF to make it behave somewhat like newRetainedCAF. In this path, CAFs are added to another list, dyn_caf_list (which is chained through static_link), which revertCAFs does not touch. This keepCAFs codepath neglects to add the CAFs to the debug_caf_list, which is perhaps okay since we will never collect these CAFs anyways.

For this reason CAF reversion can't be expected to work in the dynamic way. This explains some of the strange behavior I have observed in GHCi and while debugging CAF issues in the past.

Because breaking CAF reversion is quite bad for some uses, we at some point merged a patch from a third-party introducing the --high-mem-dynamic RTS flag. This allows us to recover the ability to revert CAFs in dynamic objects by way of a rather fragile address check. This seems entirely untenable given the growing ubiquity of ASLR.

A better solution?

All of the above seems like a rather horrible state of affairs. The implementation is subtle, the semantics are poorly defined and context dependent, and in some cases we are forced on rely on fragile assumptions on address-space layout to try to recover sensible semantics.

However, the problem we are trying to solve is very simple: We need a way to distinguish CAFs defined in dynamically-loaded code (e.g. that loaded in GHCi) from those defined in the executable image. I suggest the following straightforward mechanism of addressing this problem:

the RTS defines a global revertible variable
the code generator forms a linked list of all CAFs defined in a compilation unit via the initial values of their static_link fields
each compilation unit carries a static constructor which calls a registration function in the RTS.
when called, the registration function will link all CAFs onto revertible_caf_list if revertible is set; otherwise it will do nothing

Admittedly, I am a tad worried about the runtime overhead of this. However, we do similar things for IPE information and cost-centre registration so perhaps it is acceptable.

Relevant notes

Note [GHCi CAFs] in rts/sm/Storage.c
Note [STATIC_LINK fields] in rts/sm/Storage.h
Note [dyn_caf_list] in rts/sm/Storage.c

Edited Dec 20, 2022 by Ben Gamari

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information