Consider inlining dirty_MUT_VAR into Cmm or changing it's preconditions.
Currently in a few places we have code like this in Cmm primops:
if (GET_INFO(mv) == stg_MUT_VAR_CLEAN_info) {
ccall dirty_MUT_VAR(BaseReg "ptr", mv "ptr");
}
And then dirty_MUT_VAR var looks like this:
void
dirty_MUT_VAR(StgRegTable *reg, StgMutVar *mvar, StgClosure *old)
{
Capability *cap = regTableToCapability(reg);
// No barrier required here as no other heap object fields are read. See
// note [Heap memory barriers] in SMP.h.
if (RELAXED_LOAD(&mvar->header.info) == &stg_MUT_VAR_CLEAN_info) {
SET_INFO((StgClosure*) mvar, &stg_MUT_VAR_DIRTY_info);
recordClosureMutated(cap, (StgClosure *) mvar);
IF_NONMOVING_WRITE_BARRIER_ENABLED {
// See Note [Dirty flags in the non-moving collector] in NonMoving.c
updateRemembSetPushClosure_(reg, old);
}
}
}
The comparison against stg_MUT_VAR_CLEAN_info is repeated after the call wasting cycles for ghcs primops. (Prim.cmm).
As an alternative we could just require all callers to dirty_MUT_VAR to perform this check ahead of time. I think it's only called from two places.
- Prim.cmm (already does the check)
- Code generated from the
writeMutVar#primop. (doesn't do the check).
I've hit this when experimenting with LBU profiling on ghc with -dverbose-core2core. -dverbose-core2core dumps a lot of output. As a consequence we spend a lot of time writing dumps.
In the profile I noticed this hotspot:
- 35.89% base_GHCziIOziHandleziInternals_writeCharBuffer1_info Internals.hs:552 (cycles:2398)
- 28.99% dirty_MUT_VAR@plt +0
- 16.55% _base_GHCziIOziHandleziText_zdwact1_s9w1_entry Text.hs:722 (cycles:41)
- 15.96% dirty_MUT_VAR@plt +0
- base_GHCziIOziHandleziInternals_writeCharBuffer1_info Internals.hs:552 (cycles:1914)
- 14.33% dirty_MUT_VAR@plt +0
- 8.08% _base_GHCziIOziHandleziText_zdwact_r5dS_entry Text.hs:654 (cycles:1146)
dirty_MUT_VAR@plt +0
base_GHCziIOziHandleziInternals_writeCharBuffer1_info Internals.hs:552 (cycles:551)
dirty_MUT_VAR@plt +0
- _base_GHCziIOziHandleziText_zdwact1_s9w1_entry Text.hs:722 (cycles:25)
... repeats a few more times
There we have a recursive function that goes something like this:
writeCharBuffer :: Handle__ -> CharBuffer -> IO ()
writeCharBuffer h_@Handle__{..} !cbuf = do
--
bbuf <- readIORef haByteBuffer
(cbuf',bbuf') <- <encode cbuf>
...
-- flush the byte buffer if it is full
if <full>
then do
bbuf'' <- Buffered.flushWriteBuffer haDevice bbuf'
writeIORef haByteBuffer bbuf''
else
writeIORef haByteBuffer bbuf'
if not (isEmptyBuffer cbuf')
then writeCharBuffer h_ cbuf'
else return ()
This function based on the profile obviously calls itself quite often recursively, each time writing an IORef which, most likely, will be dirty already.
Based on the profile it seems like each recursion takes ~4500 cycles, and we call dirty_Mut_Var 4 times.
If each call costs us 10 cycles of overhead, then this amounts driver cycles and slightly less than 1% of the cycles we spend on the recursion itself.
It's not a big win, but it would be an easy change to make so might be worthwhile.