Daniel Gröber (dxld) requested to merge DanielG/ghc:rts-root-profiler into master Jul 18, 2019

This is a continuation of !1227 (merged) implementing #16788 (closed). This branch is now ready for merging. There is one more PR pending at !5071 finishing up the feature.

Example

Say you have an application with a giant state datatype, like GHC's HscEnv, and you'd like to know how much memory each of the datatype's fields are using and maybe even how much they share. For example say we want to look at the external package state and finder cache in HscEnv. This new profiling mode lets you do that.

All you have to do is strategically place a call to the RTS function setRootProfPtrs (via the Haskell wrapper setHeapRoots, see below) such that whenever the heap objects you want to measure change the RTS is informed.

I'm going to use the simple program below to demonstrate how the counters behave instead of really doing it with GHC:

main = do
  -- get rid of thunks so the numbers are easier to interpret
  evaluate $ force hsc
  setHeapRoots
    [ Root "hsc" hsc
    , Root "eps" (hsc_EPS hsc)
    , Root "fc"  (hsc_FC hsc)
    ]

hsc = HscEnv
  { hsc_EPS   = EPS   x y   e
  , hsc_FC    = FC    x y z f
  , hsc_OTHER = OTHER x   z t
  }

data HscEnv = HscEnv { hsc_EPS :: EPS, hsc_FC :: FC, hsc_OTHER :: OTHER }
data EPS    = EPS    { eps_A, eps_B, eps_C :: Word }
data FC     = FC     { fc_A, fc_B, fc_C, fc_D :: Word }
data OTHER  = OTHER  { o_A, o_B, o_C :: Word }

x = 1
y = 2
z = 3
e = 4
f = 5
t = 6

Notice how the various fields of hsc share some variables but not others.

After compiling this program with profiling we can request root profiling by roots with +RTS -ho -i0 (-i0 makes the profiler run on every garbage collection, this is just for demonstration). This will produce a heap profile which gives us a complete picture of the memory usage of the three heap objects we set using setHeapRoots, over time.

So for example the profiler output for one sample might look like the following:

hsc         80
eps         0
fc          0
hsc-eps     48
hsc-fc      72
eps-fc      0
hsc-eps-fc  32

This would tell us that:

"hsc" is using 80 bytes of memory which are not shared with any of the other roots.
- 1 word info ptr + 3 words payload of HscEnv
- 1 word info ptr + 3 words payload of OTHER
- 1 word info ptr + 1 word payload of Word of t
- or (1 + 3 + 1 + 3 + 1 + 1 = 10 words) * 8 bytes = 80 bytes
"eps" and "fc" do not have any unshared memory usage since they are fully contained in, and thus reachable from, "hsc",
"hsc" and "eps" share 48 bytes, i.e. 48 bytes worth of heap objects are reachable from both the "hsc" and "eps" roots but not from "fc". Looking at the definition, only e is not shared with anything else hence:
- 1 word info ptr + 3 words payload of EPS
- 1 word info ptr + 1 words payload of Word32 of e
- or (1 + 3 + (1 + 1) = 6 words) * 8 bytes = 48 bytes. You get the idea.
"hsc" and "fc" share 72 bytes exclusively and
"hsc", "eps" and "fc" share 32 bytes exclusively.

the sum of all these "bins" 80 + 48 + 72 + 32 = 232 is the total amount of memory reachable from the set of roots, hence it makes sense to display them stacked in a diagram as hp2ps does.

Note that the number of roots is limited to a smallish number, 20 in the current implementation.

Algorithm

The root profiler works by traversing the heap using the new generalised heap traversal module TraverseHeap.c.

Each "root" (set via setRootProfPtrs) is given a corresponding bit in a closure's heap profiling header to mark whether or not this closure is reachable from that root. If there is n roots then this gives us (2^n-1) possible combinations, each of which also gets a corresponding "bin" in an array, sizes[].

We then traverse each root in turn and for each unseen closure add its size to the sizes[] bin corresponding to the current root.

When completing the traversal of a subtree (including leaf nodes) for the first time we fill in the "root-reachable" bitfield in the corresponding closure header just with the current root's bit.

When revisiting a node which already had its header filled we check to see if the closure is already marked as reachable from the current root, if not we subtract its size from the sizes[] bin corresponding to the closure's current "root-rechable" bitfield, add it to the new bin (with the current root's bit also set) and traverse its children again to update their headers and sizes entries too.

Note that the bitfield essentially corresponds to a non-empty set of the root IDs and the sizes array has this set as its index.

Due to the limited number of bits in the heap profiling header and the exponential size of a dense sizes[] array the number of roots is currently limited to 20. This already makes the sizes[] array about 8 MiB in size. Going significantly higher will need a sparse sizes[] array i.e. a hashtable for example.

TODO

Add new -ho RTS flag to user guide
Test with a real application
Where should the Haskell facing API go?

The root profiler exposes the C function "setRootProfPtrs" to allow Haskell programs to specify objects of interest. Currently the setHeapRoots function from above has to be defined in used code like below. It's not exactly super complicated but it would be nice to have that in a user-facing library somewhere. Would it be OK to add this to ghc-heap maybe or should I add an entirely new library?

foreign import ccall unsafe "setRootProfPtrs" c_setRootProfPtrs
  :: CInt -> Ptr (StablePtr a) -> Ptr CString -> IO ()

foreign import ccall "&g_rootProfileDebugLevel" g_rootProfileDebugLevel
  :: Ptr CInt

data Root = forall a. Root
  { rootDescr   :: String
  , rootClosure :: a
  }

setHeapRoots :: [Root] -> IO ()
setHeapRoots xs = do
  descs <- mapM (newCString . rootDescr) xs
  sps   <- forM xs $ \(Root _ a) ->
    newStablePtr =<< evaluate (unsafeCoerce a :: a)
  withArray descs $ \descs_arr ->
    withArray sps $ \sps_arr ->
      c_setRootProfPtrs (fromIntegral (length xs)) sps_arr descs_arr

Get rid of unsafeCoerce in setHeapRoots

Maybe I can access the rootClosure field in data Root directly from the RTS?

Use weak pointers for the roots

We're using StablePtrs to keep track of the roots in the RTS (see setHeapProfilingRoots) that means these objects don't get free'd when they should though. We should use WeakPtrs to fix this, but that adds additional overhead we'd like to subtract out or skip over when traversing somehow.

Edits:

Flesh out example
Make setHeapRoots more polymorphic by using an existential

Edited Sep 20, 2021 by Daniel Gröber (dxld)

Introduce heap profiling by user specified roots

Example

Algorithm

TODO

Merge request reports