GHC 7.8—8.10, putMVar segfault
Summary
Intermittent segfaults observed in multi-threaded program (likely via BoundedChan) with GHC 8.10.2
[ TL;DR. Missing dirty_MVAR
call in putMvar
resulted in MVar incorrectly remaining clean
even when holding the last reference to a TSO queue head in a younger generation, the queue head was subsequently moved by the GC, but the MVar pointer was not updated. As a result the MVar's TSO queue was corrupted with a dangling pointer to an unexpected object or just random content in memory. ]
Steps to reproduce
Crash many tens of minutes into the run, which was processing tens of GB of data, so the crash is by no means immediate, or easy to reproduce. It has happened a few times now. Rather difficult to reproduce, very sensitive to scheduler timing and workload. Things that made it more likely were:
- Limiting the depth of the BoundedChan to
1
, thus increasing inter-thread contention - Limiting the heap size with
-A128k
, making GC more frequent. - Running on a bare-metal 16-core/32-thread machine, to get more effective concurrency.
- A multi-layer pipeline of BoundedChan's between source and sink:
- HTTPS or stdin
- gunzip
- group into chunks of 1k lines
- parallel JSON parser/filter
- output
- Large dataset from internet-wide IP survey, compressed to multiple GB.
Expected behavior
No segfault. (Appears to be resolved via !4457 (closed), !4458 (closed), !4459 (merged) and !4460 (closed). Issue introduced in 5d9e686c)
Environment
- GHC version used: GHC 8.10.2 (but applies to all releases from 7.8 onward, MRs filed for 8.8, 8.10, 9.0 and master).
Optional:
- Operating System: Fedora 31
- System Architecture: x86_64