Nursery size adulterates cachegrind metrics in nofib

I'm currently investigating an alleged regression in my branch of the late lambda lift and hit a confusing data point. Note that I'm very much relying on cachegrinds counted instructions/memory accesses for my findings.

Check out the most recent version of nofib, enter shootout/binary-trees and run the following script:

#! /bin/sh
sed -i 's/import Debug.Trace//g' Main.hs # Make the following line idempotent
echo "import Debug.Trace" | cat - Main.hs > Main.tmp && mv Main.tmp Main.hs # add the import for trace

# bt1: Vanilly
sed -i 's/trace "" \$ bit/bit/g' Main.hs # strip `trace $ ` prefixes in the call to `bit`
ghc -O2 -XBangPatterns Main.hs -o bt1

# bt2: Additional trace call
sed -i 's/bit/trace "" $ bit/g' Main.hs # prepend `trace $ ` to the call to `bit`
ghc -O2 -XBangPatterns Main.hs -o bt2

valgrind --tool=cachegrind ./bt1 12 2>&1 > /dev/null # without trace
valgrind --tool=cachegrind ./bt2 12 2>&1 > /dev/null # with trace

This will compile two versions of binary-trees, the original, unchanged version and one with an extra trace "" $ call before the only call to the bit function. One would expect the version with the trace call (bt2) to allocate more than the version without (bt1). Indeed, the output of +RTS -s suggests that:

$ ./bt1 12 +RTS -s
...
43,107,560 bytes allocated in the heap
...
$ ./bt2 12 +RTS -s
...
43,116,888 bytes allocated in the heap
...

That's fine. A few benchmark runs by hand also suggested the tracing version is a little slower (probably due to IO).

Compare that to the output of the above cachegrind calls:

$ valgrind --tool=cachegrind ./bt1 12 > /dev/null
...
I   refs:      118,697,583
...
D   refs:       43,475,212
...
$ valgrind --tool=cachegrind ./bt2 12 > /dev/null
...
I   refs:      116,340,710
...
D   refs:       42,523,369
...

It's the other way round here! How's that possible?

Even if this isn't strictly a bug in GHC or NoFib, it's relevant nonetheless, as our benchmark infrastructure currently relies on instruction counts. I couldn't reproduce this by writing my own no-op trace _ a = a; {-# NOINLINE trace #-}, btw.

I checked this on GHC 8.2.2, 8.4.3 and a semi-recent HEAD commit (bb539cfe).

Edited Mar 10, 2019 by Sebastian Graf

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information