rts/Sanity: Avoid nasty race in weak pointer sanity-checking
While investigating #18919 (closed) I ran into a nasty spurious crash due to the sanity-checker. In particular the issue arises when a worker GC thread and the main GC thread race to evacuate a weak pointer, resulting in a spurious copy of the weak pointer. This results in a weak pointer which is not added to weak_ptr_list
and therefore will never have its key evacuated. I have written a full story-line in the Note:
Note [Racing weak pointer evacuation]
While debugging a GC crash (#18919 (closed)) I noticed a spurious crash due to the end-of-GC sanity check stumbling across a weak pointer with unevacuated key. This can happen when two GC threads race to evacuate a weak pointer. Specifically, we start out with a heap with a weak pointer reachable from both a generation's weak pointer list and some other root-reachable closure (e.g. a Just constructor):
O W ┌──────────┐ ┌──────────┐ Root ────→ │ Just │ ╭───→ │ Weak# │ ←─────── weak_ptr_list Set ├──────────┤ │ ├──────────┤ │ │ ────╯ │ value │ ─→ ... └──────────┘ │ key │ ───╮ K │ ... │ │ ┌──────────┐ └──────────┘ ╰──→ │ ... │ ├──────────┤
The situation proceeds as follows:
- Thread A initiates a GC, wakes up the GC worker threads, and starts evacuating roots.
- Thread A evacuates a weak pointer object O to location O'.
- Thread A fills the block where O' lives and pushes it to its work-stealing queue.
- Thread B steals the O' block and starts scavenging it.
- Thread A enters markWeakPtrList.
- Thread A starts evacuating W, resulting in Wb'.
- Thread B scavenges O', evacuating W', resulting in Wa'.
- Thread A and B are now racing to evacuate W. Only one will win the race (due to the CAS in copy_tag). Let the winning copy be called W'.
- W will be replaced by a forwarding pointer to the winning copy, W'.
- Whichever thread loses the race will retry evacuation, see that W has already been evacuated, and proceed as usual.
- W' will get added to weak_ptr_list by markWeakPtrList.
- Eventually W' will be scavenged.
- traverseWeakPtrList will see that W' has been scavenged and evacuate the its key.
- However, the copy that lost the race is not on
weak_ptr_list
and will therefore never get itskey
field scavenged (sincetraverseWeakPtrList
will never see it).Now the heap looks like:
O' W (from-space) ┌──────────┐ ┌──────────┐ Root ────→ │ Just │ │ Fwd-ptr │ ───────────╮ Set ├──────────┤ ├──────────┤ │ │ │ ────╮ │ value │ ─→ ... │ └──────────┘ │ │ key │ ────────────────────────╮ │ │ ... │ │ │ │ └──────────┘ │ │ │ │ │ │ Wa' │ │ │ ┌──────────┐ ╭────╯ │ ╰───→ │ Weak# │ ←─────┤ │ ├──────────┤ ╰─ weak_ptr_list │ │ value │ ─→ ... │ │ key │ ───╮ K' │ │ ... │ │ ┌──────────┐ │ └──────────┘ ╰──→ │ ... │ │ ├──────────┤ │ Wb' │ ┌──────────┐ │ │ Weak# │ │ ├──────────┤ │ │ value │ ─→ ... │ │ key │ ───╮ K (from-space) │ │ ... │ │ ┌──────────┐ │ └──────────┘ ╰──→ │ 0xaaaaa │ ←──╯ ├──────────┤
Without sanity checking this is fine; we have introduced a spurious copy of W, Wb' into the heap but it is unreachable and therefore won't cause any trouble. However, with sanity checking we may encounter this spurious copy when walking the heap. Moreover, this copy was never added to weak_ptr_list, meaning that its key field will not get scavenged and will therefore point into from-space..
To avoid this checkClosure skips over the key field when it sees a weak pointer. Note that all fields of Wb' other than the key field should be valid, so we don't skip the closure entirely.
We then do additional checking of all closures on the weak_ptr_lists, where we do check
key
.