Baseline of performance tests should be the last metric in/decrease rather than HEAD^ to counter drift

The recent drama with performance test flukes (!2192 (comment 245745) and d46a72e1) started when we decreased the threshold of a couple of performance tests (!1983 (closed)) from 10% (mostly) to 1% (mostly). While the goal is honorable, I find it problematic for two reasons

We get mentioned CI flukes, fueling a distrust in CI performance tests and at best leading to a lot of Metric Increase/Metric Decrease noise like d46a72e1, at worst to accepting performance regressions which were actually caused by a bug in the code.
It doesn't counter performance drift, it just lessens the effects. Put plainly: 40 consecutive commits with a 0.5% increase will lead to a silent >20% drift regardless.

While weeping on the shoulder of @matheus23 today after telling him about my experience, he came up with a great idea: To counter drift, performance tests should use the commit with the last metric increase for a test as a baseline, rather than HEAD^ (ultimately defaulting to the last GHC release). This is much more effective, even if we pick a threshold that isn't susceptible to CI flukes (1.), like 5%. Even with a 10% threshold, the 20% drift scenario can be ruled out.

CC'ing @mpickering, @AndreasK, @bgamari, @alp, @osa1

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information