nofib should use criterion-style bootstrapping/sampling
As I discovered when investigating situations like #9570, in some cases, test-cases in nofib are giving nonsense, and it's hard to tell unless you run nofib several times and notice that percentage differences are fluctuating up and down. The quality of the numbers we get for uninformed users would be better if we ran some statistical analysis to tell how many times to run the benchmark, and if there were lots of outliers (rather than just blindly summarizing all the runs using an average.)
Trac metadata
Trac field | Value |
---|---|
Version | 7.9 |
Type | FeatureRequest |
TypeOfFailure | OtherFailure |
Priority | normal |
Resolution | Unresolved |
Component | NoFib benchmark suite |
Test case | |
Differential revisions | |
BlockedBy | |
Related | |
Blocking | |
CC | |
Operating system | |
Architecture |