Introduce heap profiling by user specified roots
This is the conclusion of !1227 (merged) implementing #16788 (closed). This branch is now ready for merging. Current development is at !5071.
Example
Say you have an application with a giant state datatype, like GHC's
HscEnv
, and you'd like to know how much memory each of the datatype's
fields are using and maybe even how much they share. For example say we
want to look at the external package state and finder cache in
HscEnv
. This new profiling mode lets you do that.
All you have to do is strategically place a call to the RTS function
setRootProfPtrs
(via the Haskell wrapper setHeapRoots
, see below) such
that whenever the heap objects you want to measure change the RTS is
informed.
I'm going to use the simple program below to demonstrate how the counters behave instead of really doing it with GHC:
main = do
 get rid of thunks so the numbers are easier to interpret
evaluate $ force hsc
setHeapRoots
[ Root "hsc" hsc
, Root "eps" (hsc_EPS hsc)
, Root "fc" (hsc_FC hsc)
]
hsc = HscEnv
{ hsc_EPS = EPS x y e
, hsc_FC = FC x y z f
, hsc_OTHER = OTHER x z t
}
data HscEnv = HscEnv { hsc_EPS :: EPS, hsc_FC :: FC, hsc_OTHER :: OTHER }
data EPS = EPS { eps_A, eps_B, eps_C :: Word }
data FC = FC { fc_A, fc_B, fc_C, fc_D :: Word }
data OTHER = OTHER { o_A, o_B, o_C :: Word }
x = 1
y = 2
z = 3
e = 4
f = 5
t = 6
Notice how the various fields of hsc
share some variables but not others.
After compiling this program with profiling we can request root profiling
by roots with +RTS ho i0
(i0 makes the profiler run on every garbage
collection, this is just for demonstration). This will produce a heap
profile which gives us a complete picture of the memory usage of the three
heap objects we set using setHeapRoots
, over time.
So for example the profiler output for one sample might look like the following:
hsc 80
eps 0
fc 0
hsceps 48
hscfc 72
epsfc 0
hscepsfc 32
This would tell us that:

"hsc" is using 80 bytes of memory which are not shared with any of the other roots.
 1 word info ptr + 3 words payload of HscEnv
 1 word info ptr + 3 words payload of OTHER
 1 word info ptr + 1 word payload of Word of
t
 or (1 + 3 + 1 + 3 + 1 + 1 = 10 words) * 8 bytes = 80 bytes

"eps" and "fc" do not have any unshared memory usage since they are fully contained in, and thus reachable from, "hsc",

"hsc" and "eps" share 48 bytes, i.e. 48 bytes worth of heap objects are reachable from both the "hsc" and "eps" roots but not from "fc". Looking at the definition, only
e
is not shared with anything else hence: 1 word info ptr + 3 words payload of EPS
 1 word info ptr + 1 words payload of Word32 of
e
 or (1 + 3 + (1 + 1) = 6 words) * 8 bytes = 48 bytes. You get the idea.

"hsc" and "fc" share 72 bytes exclusively and

"hsc", "eps" and "fc" share 32 bytes exclusively.
the sum of all these "bins" 80 + 48 + 72 + 32 = 232
is the total amount
of memory reachable from the set of roots, hence it makes sense to display
them stacked in a diagram as hp2ps
does.
Note that the number of roots is limited to a smallish number, 20 in the current implementation.
Algorithm
The root profiler works by traversing the heap using the new generalised
heap traversal module TraverseHeap.c
.
Each "root" (set via setRootProfPtrs
) is given a corresponding bit in a
closure's heap profiling header to mark whether or not this closure is
reachable from that root. If there is n
roots then this gives us (2^n1)
possible combinations, each of which also gets a corresponding "bin" in an
array, sizes[]
.
We then traverse each root in turn and for each unseen closure add its size
to the sizes[]
bin corresponding to the current root.
When completing the traversal of a subtree (including leaf nodes) for the first time we fill in the "rootreachable" bitfield in the corresponding closure header just with the current root's bit.
When revisiting a node which already had its header filled we check to see
if the closure is already marked as reachable from the current root, if not
we subtract its size from the sizes[]
bin corresponding to the closure's
current "rootrechable" bitfield, add it to the new bin (with the current
root's bit also set) and traverse its children again to update their
headers and sizes entries too.
Note that the bitfield essentially corresponds to a nonempty set of the root IDs and the sizes array has this set as its index.
Due to the limited number of bits in the heap profiling header and the
exponential size of a dense sizes[]
array the number of roots is
currently limited to 20. This already makes the sizes[]
array about 8 MiB
in size. Going significantly higher will need a sparse sizes[]
array
i.e. a hashtable for example.
TODO

Add new
ho
RTS flag to user guide 
Test with a real application

Where should the Haskell facing API go?
The root profiler exposes the C function "setRootProfPtrs" to allow Haskell
programs to specify objects of interest. Currently the setHeapRoots
function from above has to be defined in used code like below. It's not
exactly super complicated but it would be nice to have that in a
userfacing library somewhere. Would it be OK to add this to ghcheap maybe
or should I add an entirely new library?
foreign import ccall unsafe "setRootProfPtrs" c_setRootProfPtrs
:: CInt > Ptr (StablePtr a) > Ptr CString > IO ()
foreign import ccall "&g_rootProfileDebugLevel" g_rootProfileDebugLevel
:: Ptr CInt
data Root = forall a. Root
{ rootDescr :: String
, rootClosure :: a
}
setHeapRoots :: [Root] > IO ()
setHeapRoots xs = do
descs < mapM (newCString . rootDescr) xs
sps < forM xs $ \(Root _ a) >
newStablePtr =<< evaluate (unsafeCoerce a :: a)
withArray descs $ \descs_arr >
withArray sps $ \sps_arr >
c_setRootProfPtrs (fromIntegral (length xs)) sps_arr descs_arr

Get rid of
unsafeCoerce
in setHeapRoots
Maybe I can access the rootClosure
field in data Root
directly from the
RTS?
 Use weak pointers for the roots
We're using StablePtrs to keep track of the roots in the RTS (see setHeapProfilingRoots) that means these objects don't get free'd when they should though. We should use WeakPtrs to fix this, but that adds additional overhead we'd like to subtract out or skip over when traversing somehow.
Edits:
 Flesh out example
 Make setHeapRoots more polymorphic by using an existential