Proposed Change to GHC Performance Tests
This has now been implemented.
GHC has a test suite that can be run via
make test (see Running the Testsuite). A subset of the tests are performance tests, which either test the performance of code generated by GHC or test the performance of GHC itself. This proposal is concerned with such performance tests and is motivated by various shortcomings in the current system. The fundamental changes proposed here are:
No longer specify expected performance results in
- Performance tests run against the previous commit's results instead of the expected results.
Log performance results (using git notes) for all test runs.
- The CI server will automatically record performance results for each merged commit, pushing the performance results to the git.haskell.com repo.
- A simple CLI tool is provided along side the test driver to analyse previous results.
Recording and comparing performance results is always done with respect to a platform.
Tests are listed in
.Tfiles along with manually chosen expected value and tolerance percentage. Expected values may be, but often aren't, platform specific. e.g. testing 'bytes allocated' with a tolerance of 10%:
test('T5631', [ compiler_stats_num_field( 'bytes allocated', [ (wordsize(32), 570137436, 10), (wordsize(64), 1106015512, 5) ]), only_ways(['normal']) ], compile, [''])
When running performance tests:
- The results are compared to the expected values with respective tolerances.
- Tests fail if they are outside the tolerance percentage from the expected value.
Issues with the current system
Performance creep. Successive commits might reduce performance by (say) 0.5%, but only the 20'th such change will trip the 10% threshold. That particular commit isn't really to blame (or only 1/20 to blame); and this in turn leads authors to take regressions as annoyances rather than as actual errors.
The programmer must specify expected values and tolerances. This seems like a reasonable expectation but is a source of frustration in practice. The CI server will run tests on various platforms which may all give varying results, making it hard for the programmer to pick an appropriate value. This leads to a time consuming workflow of pushing code to the CI server and manually observing its output to derive expected values. This also leads to other issues:
- The programmer must update expected values. When performance significantly changes, such that performance tests fail, the programmer must pick new expected values and/or adjust the tolerance values. Picking these values again is more effort than we'd like.
- Expected values are usually NOT per platform. While it is currently possible to give expected values per platform, doing so takes extra effort for the programmer and is often not done. Instead the programmer often picks some middle value with a tolerance that works for all platforms.
- Tolerance intervals are too large. As the expected values are usually not platform specific, the tolerance values must be larger to make the tests pass in all cases (avoiding false positives). This leads to more false negatives (tests pass even though performance has changed unexpectedly).
Performance test results are not logged. Currently there is no quick way to observe past performance other than to checkout, build, and run the performance tests for the commits in question.
Testing locally is not optimal. Again, the expected values are likely not targeted for the programmer's local platform leading to false positives and/or negatives when testing locally.
Tests are listed in
.Tfiles along with manually chosen tolerance percentage (but no expected value and never platform specific, that is handled automatically). e.g. testing 'bytes allocated' with a tolerance of 10%:
test('T5631', [ collect_compiler_stats('bytes allocated', 10), only_ways(['normal']) ], compile, [''])
When running performance tests:
The results are compared to the previous commit's results (stored in a git note) with respective tolerances (stored in the
Tests fail if they are outside the tolerance percentage of the previous commit's result, unless the programmer has indicated in the current commit message that a change is expected (see below).
- If the same test was run multiple times on the previous commit, then the average of those runs is used.
Tests trivially pass with a warning if results for the previous commit do not exist.
All test results (tagged with the current platform) are appended to the git note for the current commit.
The CI server will automatically record performance results for each merged commit, pushing the performance results to the git.haskell.com repo.
A simple CLI tool is provided along side the test driver to analyse previous results. E.g. comparing results for test
HEADcan be done with the following command:
> python3 ./testsuite/driver/perf_notes.py --test-name T123 HEAD $B $A
With the proposed change, we hope to achieve the following benefits:
Per platform performance results means less variance of results. This allows for lower tolerance values and less false negatives without increasing false positives.
Logged performance results makes analysing performance changes faster and easier.
Less work for the programmer.
- Expected values are no longer needed.
- Expected performance changes only requires an indication that the change exists, rather than having to re-adjust the expected values and/or tolerance.
More accurate local testing.
Test results are per platform
Performance results can vary between platforms. For example a 64 bit ghc may use more memory than a 32 bit ghc. Different OSes may also affect performance differently. Hence results are always recorded and compared with respect to a given platform. This is a key advantage over the current system. Performance results within a platform have less variance which allows for smaller tolerance values.
In the implementation of this proposal the platform is a string set by the TEST_ENV environment variable. This is set by all relevant CircleCI jobs (to e.g. 'x86_64-linux' or 'x86_64-freebsd') but defaults to 'local' (appropriate for running tests on your local machine).
Indicating expected performance changes
In some cases, performance tests will fail due to an expected change in performance. In this case, the programmer should double check that the change truly is expected. If the change truly is expected, they must append the git commit message to indicate this. For example, this indicates that the 'bytes allocated' metric increases for tests T123 and T456 and decreases for T789.
Metric Increase 'bytes allocated': T123 T456 Metric Decrease 'bytes allocated': T789
The commit message will be parsed by the test driver and will allow the indicated changes. The exact text required to pass all failing tests will conveniently be output by the test driver if any performance tests fail.
Often a contributor will want to run performance tests on their local machine to characterise the effect of their patch on performance. For this they will want to first run the testsuite on their branch's base commit to establish a baseline, then again on the head of their branch. This will record the results in a git note for the 'local' platform (this git note is stored in your local git repo). The contributor can then compare the two sets of results via the CLI tool (where
$BASE_COMMIT is their branch's base commit):
> python3 ./testsuite/driver/perf_notes.py HEAD $BASE_COMMIT
Git Notes Format
The results of performance test are store in git notes as tab separated values with columns:
In order to isolate the git notes, the 'ref/notes/perf' ref is used (see the gite notes --ref option). This allows users to make their own git notes without interfering with the performance result git notes. E.g. to see the git notes in the format above for commit
$COMMIT simply do:
> git notes --ref 'perf' show $COMMIT x86_32-linux T123 normal max_bytes_used 1110101 x86_64-freebsd T123 normal max_bytes_used 2222323
This shows that test T123 was run on 2 different platforms with the normal way. On the 'x86_32-linux' platform, max_bytes_used was 1110101. On the 'x86_64-freebsd' platform, max_bytes_used was 2222323.
Most of the work on this was already done by Jared Weakly for the 2017 Haskell Summer of Code (HSOC proposal, blog). A work in progress patch was submitted to Phabricator that summer. Unfortunately HSOC ended without a merge, and development on this proposal stalled. Since then David Eichmann has continued the work and created a new patch, an adapted (and rebased) version of the old patch.
- No longer specify expected performance results in
- Performance tests run against the previous commit's results.
- Log performance results in a git note for all test runs.
- A simple CLI tool is provided to analyse previous results.
- Recording and comparing performance results is always done with respect to a platform.
- CircleCI must push test results (git notes) to the git.haskell.com repo.
- Plot the results locally (!996 (closed))
- With performance results saved as git notes, we are open to creating better tool support for analysing that data (see drift issue).
- Though a large part of the benefit of this change would be to lower tolerance percentages for tests, that will only be done later, after results have been collected.
The change to comparing with the previous commit vs an expected value means that performance results may drift in one direction without tests ever failing. e.g. With a tolerance value of 10% and an initial performance result of 100, the test will pass even if the next few commits drift well past 10% of the original results: 110, 121, 121, 133, 146. Tests pass because the commit-to-commit change is never more than 10%. We need some means of detecting such drift.
It appears that automatically detecting drift is a hard problem. So we leave it up to the developers to check for drift manually. At the moment via a CLI tool, though in the future it should be possible to create better tools that e.g. plot the data.
- Issue #12758 (closed).