Baseline of performance tests should be the last metric in/decrease rather than HEAD^ to counter drift
The recent drama with performance test flukes (!2192 (comment 245745) and d46a72e1) started when we decreased the threshold of a couple of performance tests (!1983 (closed)) from 10% (mostly) to 1% (mostly). While the goal is honorable, I find it problematic for two reasons
- We get mentioned CI flukes, fueling a distrust in CI performance tests and at best leading to a lot of
Metric Increase
/Metric Decrease
noise like d46a72e1, at worst to accepting performance regressions which were actually caused by a bug in the code. - It doesn't counter performance drift, it just lessens the effects. Put plainly: 40 consecutive commits with a 0.5% increase will lead to a silent >20% drift regardless.
While weeping on the shoulder of @matheus23 today after telling him about my experience, he came up with a great idea: To counter drift, performance tests should use the commit with the last metric increase for a test as a baseline, rather than HEAD^
(ultimately defaulting to the last GHC release). This is much more effective, even if we pick a threshold that isn't susceptible to CI flukes (1.), like 5%. Even with a 10% threshold, the 20% drift scenario can be ruled out.
CC'ing @mpickering, @AndreasK, @bgamari, @alp, @osa1