Skip to content

rts/nonmoving: Don't scavenge objects which weren't evacuated

Ben Gamari requested to merge wip/T21885 into master

This fixes a rather subtle bug in the logic responsible for scavenging objects evacuated to the non-moving generation. In particular, objects can be allocated into the non-moving generation by two ways:

a. evacuation out of from-space by the garbage collector

b. direct allocation by the mutator

Like all evacuation, objects moved by (a) must be scavenged, since they may contain references to other objects located in from-space. To accomplish this we have the following scheme:

  • each nonmoving segment's block descriptor has a scan pointer which points to the first object which has yet to be scavenged

  • the GC tracks a set of "todo" segments which have pending scavenging work

  • to scavenge a segment, we scavenge each of the unmarked blocks between the scan pointer and segment's next_free pointer.

    We skip marked blocks since we know the allocator wouldn't have allocated into marked blocks (since they contain presumably live data).

    We can stop at next_free since, by definition, the GC could not have evacuated any objects to blocks above next_free (otherwise `next_free wouldn't be the first free block).

However, this neglected to consider objects allocated by path (b). In short, the problem is that objects directly allocated by the mutator may become unreachable (but not swept, since the containing segment is not yet full), at which point they may contain references to swept objects. Specifically, we observed this in #21885 (closed) in the following way:

  1. the mutator (specifically in #21885 (closed), a lockCAF) allocates an object (specifically a blackhole, which here we will call blkh; see Note [Static objects under the nonmoving collector] for the reason why) on the non-moving heap. The bitmap of the allocated block remains 0 (since allocation doesn't affect the bitmap) and the containing segment's (which we will call blkh_seg) next_free is advanced.
  2. We enter the blackhole, evaluating the blackhole to produce a result (specificaly a cons cell) in the nursery
  3. The blackhole gets updated into an indirection pointing to the cons cell; it is pushed to the generational remembered set
  4. we perform a GC, the cons cell is evacuated into the nonmoving heap (into segment cons_seg)
  5. the cons cell is marked
  6. the GC concludes
  7. the CAF and blackhole become unreachable
  8. cons_seg is filled
  9. we start another GC; the cons cell is swept
  10. we start a new GC
  11. something is evacuated into blkh_seg, adding it to the "todo" list
  12. we attempt to scavenge blkh_seg (namely, all unmarked blocks between scan and next_free, which includes blkh). We attempt to evacuate blkh's indirectee, which is the previously-swept cons cell. This is unsafe, since the indirectee is no longer a valid heap object.

The problem here was that the scavenging logic assumed that (a) was the only source of allocations into the non-moving heap and therefore all unmarked blocks between scan and next_free were evacuated. However, due to (b) this is not true.

The solution is to ensure that that the scanned region only encompasses the region of objects allocated during evacuation. We do this by updating scan as we push the segment to the todo-segment list to point to the block which was evacuated into.

Doing this required changing the nonmoving scavenging implementation's update of the scan pointer to bump it once, instead of after scavenging each block as was done previously. This is because we may end up evacuating into the segment being scavenged as we scavenge it. This was quite tricky to discover but the result is quite simple, demonstrating yet again that global mutable state should be used exceedingly sparingly.

Fixes #21885 (closed)

Edited by Ben Gamari

Merge request reports