Race condition in GC logic of sparks.
The reproducer from #22373 (closed) triggered a race condition when GCing sparks. The issue is at follows:
- In parallel GC every capability prunes it's own spark pool. (pruneSparkPool).
- We do this by checking if the thing the spark is point has been evacuated. If so we can assume the spark is life and safely retain it. If it hasn't been evacuated we GC the spark.
- However sometimes there is a race where we look at a spark before it has been GCed. Conclude it's dead and remove it from the thread pool.
The problem arises in gcWorkerThread
there the call to pruneSparkQueue
assumes the whole heap has been marked by the time it's called.
But since there is no explicit synchronization point before we start gcing sparks that's a rather optimistic assumption.
While a thread will have finished marking it's assigned blocks there is no guarantee that all blocks have been marked as other GC threads might still be busy evacuating their share of work.
I think this can only arise when we collect without work_stealing
being true. (Usually the case for minor collections with nursery size <= 32M).
In that case inside scavenge_until_all_done
if(is_par_gc() && work_stealing && r != 0) {
will always be false and a Thread will break out of scavenge_until_all_done
as soon as it done it's work and move on to GC sparks. Possibly incorrectly as all of the heap hasn't been marked yet.