GHC 7.8—8.10, putMVar segfault
Intermittent segfaults observed in multi-threaded program (likely via BoundedChan) with GHC 8.10.2
[ TL;DR. Missing
dirty_MVAR call in
putMvar resulted in MVar incorrectly remaining
clean even when holding the last reference to a TSO queue head in a younger generation, the queue head was subsequently moved by the GC, but the MVar pointer was not updated. As a result the MVar's TSO queue was corrupted with a dangling pointer to an unexpected object or just random content in memory. ]
Steps to reproduce
Crash many tens of minutes into the run, which was processing tens of GB of data, so the crash is by no means immediate, or easy to reproduce. It has happened a few times now. Rather difficult to reproduce, very sensitive to scheduler timing and workload. Things that made it more likely were:
- Limiting the depth of the BoundedChan to
1, thus increasing inter-thread contention
- Limiting the heap size with
-A128k, making GC more frequent.
- Running on a bare-metal 16-core/32-thread machine, to get more effective concurrency.
- A multi-layer pipeline of BoundedChan's between source and sink:
- HTTPS or stdin
- group into chunks of 1k lines
- parallel JSON parser/filter
- Large dataset from internet-wide IP survey, compressed to multiple GB.
- GHC version used: GHC 8.10.2 (but applies to all releases from 7.8 onward, MRs filed for 8.8, 8.10, 9.0 and master).
- Operating System: Fedora 31
- System Architecture: x86_64