Skip to content

Fix global_link of TSOs for threads reachable via dead weaks

Ömer Sinan Ağacan requested to merge osa1/ghc:fix_tso_scavenging into master

Fixes #17785 (closed)

Here's how the problem occurs:

  • In generation 0 we have a TSO that is finished (i.e. it has no more work to do or it is killed).

  • The TSO only becomes reachable after collectDeadWeakPtrs().

  • After collectDeadWeakPtrs() we switch to WeakDone phase where we don't move TSOs to different lists anymore (like the next gen's thread list or the resurrected_threads list).

  • So the TSO will never be moved to a generation's thread list, but it will be promoted to generation 1.

  • Generation 1 collected via mark-compact, and because the TSO is reachable it is marked, and its global_link field, which is bogus at this point (because the TSO is not in a list), will be threaded.

  • Chaos ensues.

In other words, when these conditions hold:

  • A TSO is reachable only after collectDeadWeakPtrs()
  • It's finished (what_next is ThreadComplete or ThreadKilled)
  • It's retained by mark-compact collector (moving collector doesn't evacuate the global_list field)

We end up doing random mutations on the heap because the TSO's global_list field is not valid, but it still looks like a heap pointer so we thread it during compacting GC.

The fix is simple: when we traverse old_threads lists to resurrect unreachable threads the threads that won't be resurrected currently stays on the old_threads lists. Those threads will never be visited again by MarkWeak so we now reset the global_list fields. This way compacting GC does not thread pointers to nowhere.

Testing

The reproducer in #17785 (closed) is quite large and hard to build, because of the dependencies, so I'm not adding a regression test.

In my testing the reproducer would take a less than 5 seconds to run, and once in every ~5 runs would fail with a segfault or an assertion error. In other cases it also fails with a test failure. Because the tests never fail with the bug fix, assuming the code is correct, this also means that this bug can sometimes lead to incorrect runtime results.

After the fix I was able to run the reproducer repeatedly for about an hour, with no runtime crashes or test failures.

To run the reproducer clone the git repo:

$ git clone https://github.com/osa1/streamly --branch ghc-segfault

Then clone primitive and atomic-primops from their git repos and point to the clones in cabal.project.local. The project should then be buildable using GHC HEAD. Run the executable properties with +RTS -c -DZ.

In addition to the reproducer above I run the test suite using:

$ make slowtest EXTRA_HC_OPTS="-debug -with-rtsopts=-DS \
    -with-rtsopts=-c +RTS -c -RTS" SKIPWAY='nonmoving nonmoving_thr'

This enables compacting GC always in both GHC when building the test programs and when running the test programs, and also enables sanity checking when running the test programs. These set of flags are not compatible for all tests so there are some failures, but I got the same set of failures with this patch compared to GHC HEAD.

Edited by Ömer Sinan Ağacan

Merge request reports