Heap profiling and filtering by closure biography can deadlock the RTS
When heap profiling, if a closure biography filter is specified (e.g. -hc -hbdrag,void
), it is possible to non-deterministically trigger a deadlock in the RTS. The issue ultimately stems from an attempt to acquire the storage manager lock while it is already held.
I have not yet managed to create a self-contained example that can reliably reproduce this, as several unlikely events must happen in succession to provoke the bug. Here is the sequence of steps that lead to the deadlock:
-
A garbage collection is triggered.
GarbageCollect
immediately acquires the SM lock. -
At some point during the collection, a
THUNK_SELECTOR
closure must be evacuated. The GC attempts to evaluate the closure viaeval_thunk_selector
inrts/sm/Evac.c
. -
The selectee is, in fact, an evaluated constructor, so the GC can replace the selector thunk with the field it is selecting.
-
When biographical profiling is enabled, the closure that was just evaluated must be recorded as dead. To do this,
eval_thunk_selector
briefly restores the closure’s original info pointer (since it’s currently a lockedWHITEHOLE
), then callsOVERWRITING_CLOSURE
, which callsLDV_recordDead
. -
LDV_recordDead
looks up the counter associated with the given closure incensus[era].hash
, i.e. the census for the era corresponding to the collection currently taking place. If the counter already exists, it simply adjusts it and returns. However, if it doesn’t, it must allocate memory to hold it, which it does by callingarenaAlloc
. -
arenaAlloc
attempts to allocate the requested memory in the current block. If it fits, it increments the free pointer and returns. However, if it doesn’t, it must allocate a new block, which it does by callingallocGroup_lock
. -
allocGroup_lock
begins by acquiring the SM lock. Since the lock is already held, the program deadlocks.
I’m not sure what the right fix is here. Notably, when GarbageCollect
calls LdvCensusForDead
after evacuating the heap, it explicitly briefly releases the SM lock to allow LdvCensusForDead
to take it. I suspect that doing this is in the middle of evacuation is not a very good idea.
One possibility would be to somehow provide an alternate code path to LDV_recordDead
for the GC to use that suppresses the attempt to reacquire the SM lock. This would require parameterizing arenaAlloc
to instruct it to call allocGroup
rather than allocGroup_lock
.
Another possibility would be to defer recording closures evaluated by eval_thunk_selector
as dead until after evacuation completes. Intuitively, this makes sense, as this is what LdvCensusForDead
is supposed to do. In fact, I don’t understand why LdvCensusForDead
isn’t already enough to handle this case without any additional intervention by the GC.
Is this even the right thing?
As a final point, I’m not sure that calling LDV_recordDead
on these closures is even the right thing to do. LDV_recordDead
assumes that a closure’s death means the value it represents was no longer necessary, so the time since its last use is considered drag (or void if it was never used). However, destruction of THUNK_SELECTOR
closures via eval_thunk_selector
does not mean they were “discarded as no longer useful”—the value they’ve been overwritten with may still be used in the future, and their eager replacement is essentially an optimization. This optimization is somewhat incompatible with the way the biographical profiler sees the world:
-
If the optimization weren’t performed, the
THUNK_SELECTOR
closure might eventually be forced in the usual way, which would count as a use. This would make it incorrect to consider the closure in either of the drag or void states. -
On the other hand,
eval_thunk_selector
can’t itself be considered a use, as it’s entirely possible that the closure wouldn’t have been used, anyway. Therefore, considering the closure in the use state might be incorrect, too.
This suggests that either this optimization ought to be disabled when LDV profiling is enabled (which could of course be misleading in an entirely different way), or these closures ought to be accounted for in a different way, considering them in an “unknown due to optimizations” state.