GHC 8.10.2 (Linux) RTS virtual memory leak with large array of "hot" MVars?

Summary

Allocating a large array of MVars (an anti-pattern I guess) to ensure lossless updates of blocks of indices a much larger Bloom filter byte array leads to unbounded VM growth (reported by ps(1), top(1), ...) as the process runs. It is eventually killed by the OOM killer. All the while, the RTS appears to believe it is only using 4GB. Closing the input stream early to avoid memory exhaustion leads to a report with a maximum residency of just over 4GB, in contrast with the OS reporting up 30GB if left running long enough (before OOM). However, the "total memory in use" indeed matches the much larger space reported by the OS:

A few top(1) snapshots:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1130135 viktor    20   0 1027.7g   5.9g  23224 S 370.4   9.4  10:31.67 filter
1130135 viktor    20   0 1027.7g   8.0g  23288 R 965.4  12.7  15:07.12 filter
1130135 viktor    20   0 1027.7g   8.1g  23288 S 101.3  12.8  17:38.93 filter
1130135 viktor    20   0 1027.7g   8.2g  23288 S 128.2  13.0  23:53.10 filter
1130135 viktor    20   0 1027.7g   8.3g  23288 S 175.4  13.2  24:40.06 filter
1130135 viktor    20   0 1027.7g   8.5g  23288 S  94.0  13.5  38:16.03 filter

On input stream close, RTS -s reports:

 585,790,080,104 bytes allocated in the heap
   1,822,917,208 bytes copied during GC
   4,398,203,104 bytes maximum residency (9 sample(s))
      27,475,744 bytes maximum slop
            8696 MiB total memory in use (2 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     49320 colls, 49320 par   856.182s  28.103s     0.0006s    0.0299s
  Gen  1         9 colls,     8 par    3.934s   0.700s     0.0778s    0.3165s

  Parallel GC work balance: 41.75% (serial 0%, perfect 100%)

  TASKS: 55 (1 bound, 49 peak workers (54 total), using -N16)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.008s  (  0.005s elapsed)
  MUT     time  1480.515s  (721.787s elapsed)
  GC      time  860.116s  ( 28.803s elapsed)
  EXIT    time    0.326s  (  0.008s elapsed)
  Total   time  2340.964s  (750.602s elapsed)

Running with RTS -h to find the root cause eliminates the symptoms, the process then runs in constant memory (sadly Linux always reports 1TB for the virtual memory, by counting the reserved, but not yet allocated heap space).

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
1114607 viktor    20   0 1027.7g   4.3g  22840 R 554.8   6.9 114:12.71 filter
<later, with RTS -h>
1114607 viktor    20   0 1027.7g   4.3g  23120 S 454.8   6.9 231:32.58 filter
...

This time the "total memory in use" did not blow up:

1,051,271,107,552 bytes allocated in the heap
 446,739,453,112 bytes copied during GC
   4,398,878,680 bytes maximum residency (8090 sample(s))
      38,862,832 bytes maximum slop
            4422 MiB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     95834 colls, 95834 par   1811.892s  68.768s     0.0007s    0.0590s
  Gen  1      8090 colls,  8089 par   5336.780s  662.034s     0.0818s    0.4665s

  Parallel GC work balance: 72.43% (serial 0%, perfect 100%)

  TASKS: 51 (1 bound, 49 peak workers (50 total), using -N16)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.008s  (  0.005s elapsed)
  MUT     time  7688.006s  (1764.436s elapsed)
  GC      time  7148.672s  (730.802s elapsed)
  EXIT    time    0.185s  (  0.010s elapsed)
  Total   time  14836.872s  (2495.254s elapsed)

  Alloc rate    136,741,708 bytes per MUT second

  Productivity  51.8% of total user, 70.7% of total elapsed

Steps to reproduce

In ~15 parallel threads deduplicated a stream of a few 100M elements (against prior data in hand and each other) via the mutable Bloom filter. When no match is found the filter is updated. Write-access to the filter words was managed via ~2^20 MVar locks, so that concurrent read/modify/write does not lose updates (CAS turned out of course to be much better/cheaper).

The working hypothesis is that rapid random withMVar against a large array of MVars creates more GC work than GC can keep up with.

Expected behavior

Ideally, rapid updates to a large array of boxed objects should not overwhelm GC to the point where it is unable to keep up...

Environment

GHC version used: 8.10.2

Optional:

Operating System: Fedora 31, Linux 5.7.15
System Architecture: X86_64

[ !!! EDIT: Major rewrite of the original ticket... upon reflection the source of the problem was likely the large array of constantly mutating MVars and not the backing store byte array for the Bloom filter. ]

Edited Oct 05, 2020 by vdukhovni

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information