GHC 8.10.2 (Linux) RTS virtual memory leak with large array of "hot" MVars?
Summary
Allocating a large array of MVars (an anti-pattern I guess) to ensure lossless updates of blocks of indices a much larger Bloom filter byte array leads to unbounded VM growth (reported by ps(1)
, top(1)
, ...) as the process runs. It is eventually killed by the OOM killer. All the while, the RTS appears to believe it is only using 4GB. Closing the input stream early to avoid memory exhaustion leads to a report with a maximum residency of just over 4GB, in contrast with the OS reporting up 30GB if left running long enough (before OOM). However, the "total memory in use" indeed matches the much larger space reported by the OS:
A few top(1)
snapshots:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1130135 viktor 20 0 1027.7g 5.9g 23224 S 370.4 9.4 10:31.67 filter
1130135 viktor 20 0 1027.7g 8.0g 23288 R 965.4 12.7 15:07.12 filter
1130135 viktor 20 0 1027.7g 8.1g 23288 S 101.3 12.8 17:38.93 filter
1130135 viktor 20 0 1027.7g 8.2g 23288 S 128.2 13.0 23:53.10 filter
1130135 viktor 20 0 1027.7g 8.3g 23288 S 175.4 13.2 24:40.06 filter
1130135 viktor 20 0 1027.7g 8.5g 23288 S 94.0 13.5 38:16.03 filter
On input stream close, RTS -s
reports:
585,790,080,104 bytes allocated in the heap
1,822,917,208 bytes copied during GC
4,398,203,104 bytes maximum residency (9 sample(s))
27,475,744 bytes maximum slop
8696 MiB total memory in use (2 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 49320 colls, 49320 par 856.182s 28.103s 0.0006s 0.0299s
Gen 1 9 colls, 8 par 3.934s 0.700s 0.0778s 0.3165s
Parallel GC work balance: 41.75% (serial 0%, perfect 100%)
TASKS: 55 (1 bound, 49 peak workers (54 total), using -N16)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.008s ( 0.005s elapsed)
MUT time 1480.515s (721.787s elapsed)
GC time 860.116s ( 28.803s elapsed)
EXIT time 0.326s ( 0.008s elapsed)
Total time 2340.964s (750.602s elapsed)
Running with RTS -h
to find the root cause eliminates the symptoms, the process then runs in constant memory (sadly Linux always reports 1TB for the virtual memory, by counting the reserved, but not yet allocated heap space).
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1114607 viktor 20 0 1027.7g 4.3g 22840 R 554.8 6.9 114:12.71 filter
<later, with RTS -h>
1114607 viktor 20 0 1027.7g 4.3g 23120 S 454.8 6.9 231:32.58 filter
...
This time the "total memory in use" did not blow up:
1,051,271,107,552 bytes allocated in the heap
446,739,453,112 bytes copied during GC
4,398,878,680 bytes maximum residency (8090 sample(s))
38,862,832 bytes maximum slop
4422 MiB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 95834 colls, 95834 par 1811.892s 68.768s 0.0007s 0.0590s
Gen 1 8090 colls, 8089 par 5336.780s 662.034s 0.0818s 0.4665s
Parallel GC work balance: 72.43% (serial 0%, perfect 100%)
TASKS: 51 (1 bound, 49 peak workers (50 total), using -N16)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.008s ( 0.005s elapsed)
MUT time 7688.006s (1764.436s elapsed)
GC time 7148.672s (730.802s elapsed)
EXIT time 0.185s ( 0.010s elapsed)
Total time 14836.872s (2495.254s elapsed)
Alloc rate 136,741,708 bytes per MUT second
Productivity 51.8% of total user, 70.7% of total elapsed
Steps to reproduce
In ~15 parallel threads deduplicated a stream of a few 100M elements (against prior data in hand and each other) via the mutable Bloom filter. When no match is found the filter is updated. Write-access to the filter words was managed via ~2^20 MVar locks, so that concurrent read/modify/write does not lose updates (CAS turned out of course to be much better/cheaper).
The working hypothesis is that rapid random withMVar
against a large array of MVars creates more GC work than GC can keep up with.
Expected behavior
Ideally, rapid updates to a large array of boxed objects should not overwhelm GC to the point where it is unable to keep up...
Environment
- GHC version used: 8.10.2
Optional:
- Operating System: Fedora 31, Linux 5.7.15
- System Architecture: X86_64
[ !!! EDIT: Major rewrite of the original ticket... upon reflection the source of the problem was likely the large array of constantly mutating MVars and not the backing store byte array for the Bloom filter. ]