rts/sm/MarkWeak.c · cfcc3c9a1f2e4e33bed4c40767f8e7971e331c15 · Glasgow Haskell Compiler / GHC

Fix global_link of TSOs for threads reachable via dead weaks · cfcc3c9a
Ömer Sinan Ağacan authored Mar 13, 2020 and
Marge Bot committed Mar 15, 2020
Fixes #17785

Here's how the problem occurs:

- In generation 0 we have a TSO that is finished (i.e. it has no more
  work to do or it is killed).

- The TSO only becomes reachable after collectDeadWeakPtrs().

- After collectDeadWeakPtrs() we switch to WeakDone phase where we don't
  move TSOs to different lists anymore (like the next gen's thread list
  or the resurrected_threads list).

- So the TSO will never be moved to a generation's thread list, but it
  will be promoted to generation 1.

- Generation 1 collected via mark-compact, and because the TSO is
  reachable it is marked, and its `global_link` field, which is bogus at
  this point (because the TSO is not in a list), will be threaded.

- Chaos ensues.

In other words, when these conditions hold:

- A TSO is reachable only after collectDeadWeakPtrs()
- It's finished (what_next is ThreadComplete or ThreadKilled)
- It's retained by mark-compact collector (moving collector doesn't
  evacuate the global_list field)

We end up doing random mutations on the heap because the TSO's
global_list field is not valid, but it still looks like a heap pointer
so we thread it during compacting GC.

The fix is simple: when we traverse old_threads lists to resurrect
unreachable threads the threads that won't be resurrected currently
stays on the old_threads lists. Those threads will never be visited
again by MarkWeak so we now reset the global_list fields. This way
compacting GC does not thread pointers to nowhere.

Testing
-------

The reproducer in #17785 is quite large and hard to build, because of
the dependencies, so I'm not adding a regression test.

In my testing the reproducer would take a less than 5 seconds to run,
and once in every ~5 runs would fail with a segfault or an assertion
error. In other cases it also fails with a test failure. Because the
tests never fail with the bug fix, assuming the code is correct, this
also means that this bug can sometimes lead to incorrect runtime
results.

After the fix I was able to run the reproducer repeatedly for about an
hour, with no runtime crashes or test failures.

To run the reproducer clone the git repo:

    $ git clone https://github.com/osa1/streamly --branch ghc-segfault

Then clone primitive and atomic-primops from their git repos and point
to the clones in cabal.project.local. The project should then be
buildable using GHC HEAD. Run the executable `properties` with `+RTS -c
-DZ`.

In addition to the reproducer above I run the test suite using:

    $ make slowtest EXTRA_HC_OPTS="-debug -with-rtsopts=-DS \
        -with-rtsopts=-c +RTS -c -RTS" SKIPWAY='nonmoving nonmoving_thr'

This enables compacting GC always in both GHC when building the test
programs and when running the test programs, and also enables sanity
checking when running the test programs. These set of flags are not
compatible for all tests so there are some failures, but I got the same
set of failures with this patch compared to GHC HEAD.
cfcc3c9a