Skip to content
  • Ömer Sinan Ağacan's avatar
    Fix global_link of TSOs for threads reachable via dead weaks · cfcc3c9a
    Ömer Sinan Ağacan authored and Marge Bot's avatar Marge Bot committed
    Fixes #17785
    
    Here's how the problem occurs:
    
    - In generation 0 we have a TSO that is finished (i.e. it has no more
      work to do or it is killed).
    
    - The TSO only becomes reachable after collectDeadWeakPtrs().
    
    - After collectDeadWeakPtrs() we switch to WeakDone phase where we don't
      move TSOs to different lists anymore (like the next gen's thread list
      or the resurrected_threads list).
    
    - So the TSO will never be moved to a generation's thread list, but it
      will be promoted to generation 1.
    
    - Generation 1 collected via mark-compact, and because the TSO is
      reachable it is marked, and its `global_link` field, which is bogus at
      this point (because the TSO is not in a list), will be threaded.
    
    - Chaos ensues.
    
    In other words, when these conditions hold:
    
    - A TSO is reachable only after collectDeadWeakPtrs()
    - It's finished (what_next is ThreadComplete or ThreadKilled)
    - It's retained by mark-compact collector (moving collector doesn't
      evacuate the global_list field)
    
    We end up doing random mutations on the heap because the TSO's
    global_list field is not valid, but it still looks like a heap pointer
    so we thread it during compacting GC.
    
    The fix is simple: when we traverse old_threads lists to resurrect
    unreachable threads the threads that won't be resurrected currently
    stays on the old_threads lists. Those threads will never be visited
    again by MarkWeak so we now reset the global_list fields. This way
    compacting GC does not thread pointers to nowhere.
    
    Testing
    -------
    
    The reproducer in #17785 is quite large and hard to build, because of
    the dependencies, so I'm not adding a regression test.
    
    In my testing the reproducer would take a less than 5 seconds to run,
    and once in every ~5 runs would fail with a segfault or an assertion
    error. In other cases it also fails with a test failure. Because the
    tests never fail with the bug fix, assuming the code is correct, this
    also means that this bug can sometimes lead to incorrect runtime
    results.
    
    After the fix I was able to run the reproducer repeatedly for about an
    hour, with no runtime crashes or test failures.
    
    To run the reproducer clone the git repo:
    
        $ git clone https://github.com/osa1/streamly --branch ghc-segfault
    
    Then clone primitive and atomic-primops from their git repos and point
    to the clones in cabal.project.local. The project should then be
    buildable using GHC HEAD. Run the executable `properties` with `+RTS -c
    -DZ`.
    
    In addition to the reproducer above I run the test suite using:
    
        $ make slowtest EXTRA_HC_OPTS="-debug -with-rtsopts=-DS \
            -with-rtsopts=-c +RTS -c -RTS" SKIPWAY='nonmoving nonmoving_thr'
    
    This enables compacting GC always in both GHC when building the test
    programs and when running the test programs, and also enables sanity
    checking when running the test programs. These set of flags are not
    compatible for all tests so there are some failures, but I got the same
    set of failures with this patch compared to GHC HEAD.
    cfcc3c9a