Nursery size adulterates cachegrind metrics in nofib
I'm currently investigating an alleged regression in my branch of the late lambda lift and hit a confusing data point. Note that I'm very much relying on cachegrinds counted instructions/memory accesses for my findings.
Check out the most recent version of nofib
, enter shootout/binary-trees
and run the following script:
#! /bin/sh
sed -i 's/import Debug.Trace//g' Main.hs # Make the following line idempotent
echo "import Debug.Trace" | cat - Main.hs > Main.tmp && mv Main.tmp Main.hs # add the import for trace
# bt1: Vanilly
sed -i 's/trace "" \$ bit/bit/g' Main.hs # strip `trace $ ` prefixes in the call to `bit`
ghc -O2 -XBangPatterns Main.hs -o bt1
# bt2: Additional trace call
sed -i 's/bit/trace "" $ bit/g' Main.hs # prepend `trace $ ` to the call to `bit`
ghc -O2 -XBangPatterns Main.hs -o bt2
valgrind --tool=cachegrind ./bt1 12 2>&1 > /dev/null # without trace
valgrind --tool=cachegrind ./bt2 12 2>&1 > /dev/null # with trace
This will compile two versions of binary-trees
, the original, unchanged version and one with an extra trace "" $
call before the only call to the bit
function. One would expect the version with the trace
call (bt2
) to allocate more than the version without (bt1
). Indeed, the output of +RTS -s
suggests that:
$ ./bt1 12 +RTS -s
...
43,107,560 bytes allocated in the heap
...
$ ./bt2 12 +RTS -s
...
43,116,888 bytes allocated in the heap
...
That's fine. A few benchmark runs by hand also suggested the tracing version is a little slower (probably due to IO).
Compare that to the output of the above cachegrind calls:
$ valgrind --tool=cachegrind ./bt1 12 > /dev/null
...
I refs: 118,697,583
...
D refs: 43,475,212
...
$ valgrind --tool=cachegrind ./bt2 12 > /dev/null
...
I refs: 116,340,710
...
D refs: 42,523,369
...
It's the other way round here! How's that possible?
Even if this isn't strictly a bug in GHC or NoFib, it's relevant nonetheless, as our benchmark infrastructure currently relies on instruction counts. I couldn't reproduce this by writing my own no-op trace _ a = a; {-# NOINLINE trace #-}
, btw.
I checked this on GHC 8.2.2, 8.4.3 and a semi-recent HEAD commit (bb539cfe).