Commit 932cd41d authored by davide's avatar davide Committed by Ben Gamari
Browse files

testsuite: Save performance metrics in git notes.

This patch makes the following improvement:
  - Automatically records test metrics (per test environment) so that
    the programmer need not supply nor update expected values in *.T
    - On expected metric changes, the programmer need only indicate the
      direction of change in the git commit message.
  - Provides a simple python tool "" to compare metrics
    over time.

  - Using just the previous commit allows performance to drift with each
    - Currently we allow drift as we have a preference for minimizing
      false positives.
    - Some possible alternatives include:
      - Use metrics from a fixed commit per test: the last commit that
        allowed a change in performance (else the oldest metric)
      - Or use some sort of aggregate since the last commit that allowed
        a change in performance (else all available metrics)
      - These alternatives may result in a performance issue (with the
        test driver) having to heavily search git commits/notes.
  - Run locally, performance tests will trivially pass unless the tests
    were run locally on the previous commit. This is often not the case
    e.g.  after pulling recent changes.

Previously, *.T files contain statements such as:
stats_num_field('peak_megabytes_allocated', (2, 1))
compiler_stats_num_field('bytes allocated',
                         [(wordsize(64), 165890392, 10)])
This required the programmer to give the expected values and a tolerance
deviation (percentage). With this patch, the above statements are
replaced with:
collect_stats('peak_megabytes_allocated', 5)
collect_compiler_stats('bytes allocated', 10)
So that programmer must only enter which metrics to test and a tolerance
deviation. No expected value is required. CircleCI will then run the
tests per test environment and record the metrics to a git note for that
commit and push them to the ghc repo. Metrics will be
compared to the previous commit. If they are different by the tolerance
deviation from the *.T file, then the corresponding test will fail. By
adding to the git commit message e.g.
 # Metric (In|De)crease <metric(s)> <options>: <tests>
Metric Increase ['bytes allocated', 'peak_megabytes_allocated'] \
         (test_env='linux_x86', way='default'):
    Test012, Test345
Metric Decrease 'bytes allocated':
Metric Increase:
This will allow the noted changes (letting the test pass). Note that by
omitting metrics or options, the change will apply to all possible
metrics/options (i.e. in the above, an increase for all metrics in all
test environments is allowed for Test711)

phabricator will use the message in the description

Reviewers: bgamari, hvr

Reviewed By: bgamari

Subscribers: rwbarton, carter

GHC Trac Issues: #12758

Differential Revision:
parent 82a5c241
......@@ -18,7 +18,7 @@ aliases:
# ideally we would simply set THREADS here instead of re-detecting it every
# time we need it below. Unfortunately, there is no way to set an environment
# variable with the result of a shell script.
- &boot
......@@ -32,6 +32,12 @@ aliases:
include mk/flavours/\$(BuildFlavour).mk
- &set_git_identity
name: Set Git Identity
command: |
git config ""
git config "GHC CircleCI"
- &configure_unix
name: Configure
......@@ -64,10 +70,16 @@ aliases:
name: Test
command: |
mkdir -p test-results
make test THREADS=`mk/` SKIP_PERF_TESTS=YES JUNIT_FILE=../../test-results/junit.xml
- &store_test_results
path: test-results
- &push_perf_note
name: Push Performance Git Notes
command: .circleci/
- &slowtest
name: Full Test
......@@ -102,8 +114,10 @@ jobs:
<<: *buildenv
TEST_ENV: x86_64-linux
- checkout
- *set_git_identity
- *prepare
- *submodules
- *boot
......@@ -113,6 +127,7 @@ jobs:
- *storeartifacts
- *test
- *store_test_results
- *push_perf_note
resource_class: xlarge
......@@ -122,8 +137,10 @@ jobs:
<<: *buildenv
GHC_COLLECTOR_FLAVOR: x86_64-freebsd
TEST_ENV: x86_64-freebsd
- checkout
- *set_git_identity
- *prepare
- *submodules
- *boot
......@@ -133,6 +150,7 @@ jobs:
- *storeartifacts
- *test
- *store_test_results
- *push_perf_note
......@@ -147,8 +165,10 @@ jobs:
# Build with in-tree GMP since this isn't available on OS X by default.
CONFIGURE_OPTS: --with-intree-gmp
<<: *buildenv
TEST_ENV: x86_64-darwin
- checkout
- *set_git_identity
- *prepare
- *submodules
- *boot
......@@ -158,6 +178,7 @@ jobs:
- *storeartifacts
- *test
- *store_test_results
- *push_perf_note
resource_class: xlarge
......@@ -167,6 +188,7 @@ jobs:
<<: *buildenv
- checkout
- *set_git_identity
- *prepare
- *submodules
- *boot
......@@ -179,8 +201,10 @@ jobs:
- image: ghcci/x86_64-linux:0.0.4
<<: *buildenv
TEST_ENV: x86_64-linux-unreg
- checkout
- *set_git_identity
- *prepare
- *submodules
- *boot
......@@ -188,6 +212,7 @@ jobs:
- *make
- *test
- *store_test_results
- *push_perf_note
resource_class: xlarge
......@@ -196,6 +221,7 @@ jobs:
<<: *buildenv
BUILD_FLAVOUR: perf-llvm
TEST_ENV: x86_64-linux-llvm
- run:
name: Install LLVM
......@@ -206,12 +232,14 @@ jobs:
name: Verify that llc works
command: llc
- checkout
- *set_git_identity
- *prepare
- *submodules
- *boot
- *configure_unix
- *make
- *test
- *push_perf_note
# Nightly build with -DDEBUG using devel2 flavour
......@@ -221,8 +249,11 @@ jobs:
<<: *buildenv
TEST_ENV: x86_64-linux-debug
- checkout
- *set_git_identity
- *prepare
- *submodules
- *boot
......@@ -230,6 +261,7 @@ jobs:
- *make
- *test
- *store_test_results
- *push_perf_note
resource_class: xlarge
......@@ -238,8 +270,10 @@ jobs:
<<: *buildenv
TEST_ENV: i386-linux
- checkout
- *set_git_identity
- *prepare
- *submodules
- *boot
......@@ -249,6 +283,7 @@ jobs:
- *storeartifacts
- *test
- *store_test_results
- *push_perf_note
resource_class: xlarge
......@@ -257,8 +292,10 @@ jobs:
<<: *buildenv
TEST_ENV: x86_64-fedora
- checkout
- *set_git_identity
- *prepare
- *submodules
- *boot
......@@ -268,6 +305,7 @@ jobs:
- *storeartifacts
- *test
- *store_test_results
- *push_perf_note
resource_class: xlarge
......@@ -285,6 +323,7 @@ jobs:
- *make
- *slowtest
- *store_test_results
- *push_perf_note
version: 2
#!/usr/bin/env bash
# vim: sw=2 et
set -euo pipefail
fail() {
echo "ERROR: $*" >&2
exit 1
# Add as a known host.
echo "|1|F3mPVCE55+KfApNIMYQ3Dv39sGE=|1bRkvJEJhAN2R0LE/lAjFCEJGl0= ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBBUZS9jGBkE5UzpSo6irnIgcQcfzvbuIOsFc8+N61FwtZncRntbaKPuUimOFPgeaUZLl6Iajz6IIs7aduU0/v+I=" >> ~/.ssh/known_hosts
echo "|1|2VUMjYSRVpT2qJPA0rA9ap9xILY=|5OThkI4ED9V0J+Es7D5FOD55Klk= ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC+3TLluLAO4lkW60W+N2DFkS+WoRFGqLwHzgd1ifxG9TIm31wChPY3E/hgMnJmgGqWCF4UDUemmyCycEaL7FtKfzjTAclg9EfpQnozyE3T5hIo2WL7SN5O8ttG/bYGuDnn14jLnWwJyN4oz/znWFiDG9e2Oc9YFNlQ+PK8ae5xR4gqBB7EOoj9J1EiPqG2OXRr5Mei3TLsRDU6fnz/e4oFJpKWWeN6M63oePv0qoaGjxcrATZUWsuWrxVMmYo9kP1xRuFJbAUw2m4uVP+793SW1zxySi1HBMtJG+gCDdZZSwYbkV1hassLWBHv1qPttncfX8Zek3Z3VolaTmfWJTo9" >> ~/.ssh/known_hosts
# Check that a git notes dont already exist.
# This is a percausion as we reset refs/notes/perf and we want to avoid data loss.
if [ $(git notes --ref=perf list | wc -l) -ne 0 ]
fail "Found an existing git note on HEAD. Expected no git note."
# Assert that the METRICS_FILE exists and can be read.
if [ "$METRICS_FILE" = "" ] || ! [ -r $METRICS_FILE ]
fail "Metrics file not found: $METRICS_FILE"
# Reset the git notes and append the metrics file to the notes, then push and return the result.
# This is favoured over a git notes merge as it avoids potential data loss/duplication from the merge strategy.
function reset_append_note_push {
git fetch -f $GHC_ORIGIN refs/notes/perf:refs/notes/perf || true
echo "git notes --ref=perf append -F $METRICS_FILE HEAD"
git notes --ref=perf append -F $METRICS_FILE HEAD
git push $GHC_ORIGIN refs/notes/perf
# Push the metrics file as a git note. This may fail if another task pushes a note first. In that case
# the latest note is fetched and appended.
until reset_append_note_push || [ MAX_RETRY = 0 ]
echo ""
echo "Failed to push git notes. Fetching, appending, and retrying..."
......@@ -176,12 +176,7 @@ test('topHandler04',
[ stats_num_field('bytes allocated',
[ (wordsize(64), 16828144, 5)
# with GHC-7.6.3: 83937384 (but faster execution than the next line)
# before: 58771216 (without call-arity-analysis)
# expected value: 16828144 (2014-01-14)
, (wordsize(32), 8433644, 5) ])
[ collect_stats('bytes allocated',5)
, only_ways(['normal'])],
......@@ -208,9 +203,7 @@ test('T8089',
test('T8684', expect_broken(8684), compile_and_run, [''])
test('T9826',normal, compile_and_run,[''])
[ stats_num_field('bytes allocated',
[ (wordsize(64), 51840, 20)
, (wordsize(32), 47348, 20) ])
[ collect_stats('bytes allocated')
, only_ways(['normal'])],
......@@ -223,10 +216,7 @@ test('lazySTexamples', normal, compile_and_run, [''])
test('T11760', req_smp, compile_and_run, ['-threaded -with-rtsopts=-N2'])
test('T12874', normal, compile_and_run, [''])
[ stats_num_field('bytes allocated',
[ (wordsize(64), 185943272, 5) ])
# with GHC-8.1 before liftA2 change: 325065128
# GHC-8.1 with custom liftA2: 185943272
[ collect_stats('bytes allocated', 5)
, only_ways(['normal'])],
......@@ -234,7 +224,7 @@ test('T13525', when(opsys('mingw32'), skip), compile_and_run, [''])
test('T13097', normal, compile_and_run, [''])
test('functorOperators', normal, compile_and_run, [''])
[stats_num_field('max_bytes_used', [ (wordsize(64), 44504, 5) ]),
compile_and_run, ['-O'])
test('T14425', normal, compile_and_run, [''])
GHC Driver Readme
Greetings and well met. If you are reading this, I can only assume that you
are likely interested in working on the testsuite in some capacity. For more
detailed documentation, please see [here][1].
## ToC
1. Entry points of the testsuite performance tests
2. Quick overview of program parts
3. How to use the comparison tool
4. Important Types
5. Quick answers for "how do I do X"?
## Entry Points of the testsuite performance tests
The testsuite has two main entry points depending on which perspective you
approach it. From the perspective of the test writer, the entry point is the
collect_stats function called in *.T files. This function is declared in along with its associated infrastructure. The purpose of this
function is to tell the test driver what metrics to compare when processing
the test. From the perspective of running the test-suite e.g. via make, its
entry point is the file. That file contains the main logic for
running the individual tests, collecting information, handling failure, and
outputting the final results.
## Overview of how the performance test bits work.
During a Haskell Summer of Code project, an intern went through and revamped
most of the performance test code, as such there have been a few changes to it
that might be unusual to anyone previously familiar with the testsuite. One of
the biggest immediate benefits is that all platform differences, compiler
differences, and things such as that are not necessary to be considered by the
test writer anymore. This is due to the fact that the test comparison relies
entirely on locally collected metrics on the testing machine.
As such, it is perfectly sufficient to write `collect_stats('all',20)` in the
".T" files to measure the 3 potential stats that can be collected for that test
and automatically test them for regressions, failing if there is more than a 20%
change in any direction. In fact, even that is not necessary as
`collect_stats()` defaults to 'all', and 20% deviation allowed.
The function `collect_compiler_stats()` is completely equivalent in every way to
`collect_stats` except that it measures the performance of the compiler itself
rather than the performance of the code generated by the compiler. See the
implementation of collect_stats in /driver/ for more information.
If the performance of a test is improved so much that the test fails, the value
will still be recorded. The warning that will be emitted is merely a precaution
so that the programmer can double-check that they didn't introduce a bug;
something that might be suspicious if the test suddenly improves by 70%,
for example.
Performance metrics for performance tests are now stored in git notes under the
namespace 'perf'. The format of the git note file is that each line represents
a single metric for a particular test: `$test_env $test_name $test_way
$metric_measured $value_collected` (delimited by tabs).
One can view the maximum deviation a test allows by looking inside its
respective all.T file; additionally, if one sets the verbosity level of the
test-suite to a value >= 4, they will see a good amount of output per test
detailing all the information about values. This information will also print
if the test falls outside of the allowed bounds. (see the test_cmp function in
/driver/ for exact formatting of the message)
The git notes are only appended to by the testsuite in a single atomic python
subprocess at the end of the test run; if the run is canceled at any time, the
notes will not be written. The note appending command will be retried up to 4
times in the event of a failure (such as one happening due to a lock on the
repo) although this is never anticipated to happen. If, for some reason, the 5
attempts were not enough, an error message will be printed out. Further, there
is no current process or method for stripping duplicates, updating values, etc,
so if the testsuite is ran multiple times per commit there will be multiple
values in the git notes corresponding to the tests ran. In this case the
average value is used.
## Quick overview of program parts
The relevant bits of the directory tree are as such:
├── driver -- Testsuite driver directory
├── -- Contains code implementing JUnit features.
├── -- Some of the uglier implementation details.
├── -- Comparison tool and performance tests.
├── -- Main entrypoint for program; runs tests.
├── -- Global data structures and objects.
├── -- Bulk of implementation is in here.
└── -- Misc helper functions.
├── mk
└── -- Master makefile for running tests.
├── tests -- Main tests directory.
## How to Use the Comparison Tool
The comparison tool exists in `/driver/`.
When the testsuite is ran, the performance metrics of the performance tests are
saved automatically in a local git note that will be attached to the commit.
The comparison tool is designed to help analyze performance metrics across
commits using this performance information.
Currently, it can only be ran by executing the file directly, like so:
$ python3 (arguments go here)
If you run ` -h` you will see a description of all of the
arguments and how to use them. The optional arguments exist to filter the
output to include only commits that you're interested in. The most typical
usage of this tool will likely be running ` HEAD 'HEAD~1' '(commit
hash)' ...`
The way the performance metrics are stored in git notes remains strictly local
to the machine; as such, performance metrics will not exist for a commit until
you checkout that commit and run the testsuite (or test).
## Quick Answers for "How do I do X?"
* Q: How do I add a flag to "make test" to extend the testsuite functionality?
1. Add the flag in the appropriate global object in
2. Add a argument to the parser in that sets the flag
3. Go to the `testsuite/mk/` file and add a new ifeq (or ifneq)
block. I suggest adding the block around line 200.
* Q: How do I modify how performance tests work?
* That functionality resides in which has pretty good
in-code documentation.
* Additionally, one will want to look at `compile_and_run`, `simple_run`,
and `simple_build` in
#!/usr/bin/env python3
# (c) Jared Weakly 2017
# This file will be a utility to help facilitate the comparison of performance
# metrics across arbitrary commits. The file will produce a table comparing
# metrics between measurements taken for given commits in the environment
# (which defaults to 'local' if not given by --test-env).
import argparse
import re
import subprocess
import time
from collections import namedtuple
from math import ceil, trunc
from testutil import passed, failBecause
# Some data access functions. A the moment this uses git notes.
# The metrics (a.k.a stats) are named tuples, PerfStat, in this form:
# ( test_env : 'val', # Test environment.
# test : 'val', # Name of the test
# way : 'val',
# metric : 'val', # Metric being recorded
# value : 'val', # The statistic result e.g. runtime
# )
# All the fields of a metric (excluding commit field).
PerfStat = namedtuple('PerfStat', ['test_env','test','way','metric','value'])
class MetricChange:
NewMetric = 'NewMetric'
NoChange = 'NoChange'
Increase = 'Increase'
Decrease = 'Decrease'
def parse_perf_stat(stat_str):
field_vals = stat_str.strip('\t').split('\t')
return PerfStat(*field_vals)
# Get all recorded (in a git note) metrics for a given commit.
# Returns an empty array if the note is not found.
def get_perf_stats(commit='HEAD', namespace='perf'):
log = subprocess.check_output(['git', 'notes', '--ref=' + namespace, 'show', commit], stderr=subprocess.STDOUT).decode('utf-8')
except subprocess.CalledProcessError:
return []
log = log.strip('\n').split('\n')
log = list(filter(None, log))
log = [parse_perf_stat(stat_str) for stat_str in log]
return log
# Get allowed changes to performance. This is extracted from the commit message of
# the given commit in this form:
# Metric (Increase | Decrease) ['metric' | \['metrics',..\]] [\((test_env|way)='abc',...\)]: TestName01, TestName02, ...
# Returns a *dictionary* from test name to a *list* of items of the form:
# {
# 'direction': either 'Increase' or 'Decrease,
# 'metrics': ['metricA', 'metricB', ...],
# 'opts': {
# 'optionA': 'string value',
# 'optionB': 'string value',
# ...
# }
# }
def get_allowed_perf_changes(commit='HEAD'):
commitByteStr = subprocess.check_output(['git', '--no-pager', 'log', '-n1', '--format=%B', commit])
return parse_allowed_perf_changes(commitByteStr.decode())
def parse_allowed_perf_changes(commitMsg):
# Helper regex. Non-capturing unless postfixed with Cap.
s = r"(?:\s*\n?\s+)" # Space, possible new line with an indent.
qstr = r"(?:'(?:[^'\\]|\\.)*')" # Quoted string.
qstrCap = r"(?:'((?:[^'\\]|\\.)*)')" # Quoted string. Captures the string without the quotes.
innerQstrList = r"(?:"+qstr+r"(?:"+s+r"?,"+s+r"?"+qstr+r")*)?" # Inside of a list of
qstrList = r"(?:\["+s+r"?"+innerQstrList+s+r"?\])" # A list of strings (using box brackets)..
exp = (r"^Metric"
+s+r"?("+qstr+r"|"+qstrList+r")?" # Metric or list of metrics.s..
+s+r"?(\(" + r"(?:[^')]|"+qstr+r")*" + r"\))?" # Options surounded in parenthesis. (allow parenthases in quoted strings))
+s+r"?:?" # Optional ":"
+s+r"?((?:(?!\n\n)(?!\n[^\s])(?:.|\n))*)" # Test names. Stop parsing on empty or non-indented new line.
matches = re.findall(exp, commitMsg, re.M)
changes = {}
for (direction, metrics_str, opts_str, tests_str) in matches:
tests = re.findall(r"(\w+)", tests_str)
for test in tests:
changes.setdefault(test, []).append({
'direction': direction,
'metrics': re.findall(qstrCap, metrics_str),
'opts': dict(re.findall(r"(\w+)"+s+r"?="+s+r"?"+qstrCap, opts_str))
return changes
# Calculates a suggested string to append to the git commit in order to accept the
# given changes.
# changes: [(MetricChange, PerfStat)]
def allow_changes_string(changes):
Dec = MetricChange.Decrease
Inc = MetricChange.Increase
# We only care about increase / decrease metrics.
changes = [change for change in changes if change[0] in [Inc, Dec]]
# Map tests to a map from change direction to metrics.
test_to_dir_to_metrics = {}
for (change, perf_stat) in changes:
change_dir_to_metrics = test_to_dir_to_metrics.setdefault(perf_stat.test, { Inc: [], Dec: [] })
# Split into 3 groups.
# Tests where all changes are *increasing*.
# Tests where all changes are *decreasing*.
# Tests where changes are *mixed* increasing and decreasing.
groupDec = []
groupInc = []
groupMix = []
for (test, decsAndIncs) in test_to_dir_to_metrics.items():
decs = decsAndIncs[Dec]
incs = decsAndIncs[Inc]
if decs and incs:
elif not decs:
msgs = []
nltab = '\n '
# Decreasing group.
if groupDec:
msgs.append('Metric Decrease:' + nltab + nltab.join(groupDec))
# Increasing group.
if groupInc:
msgs.append('Metric Increase:' + nltab + nltab.join(groupInc))
# Mixed group.
if groupMix:
# Split mixed group tests by decrease/increase, then by metric.
dir_to_metric_to_tests = {
Dec: {},
Inc: {}
for test in groupMix:
for change_dir, metrics in test_to_dir_to_metrics[test].items():
for metric in metrics:
dir_to_metric_to_tests[change_dir].setdefault(metric, []).append(test)
for change_dir in [Dec, Inc]:
metric_to_tests = dir_to_metric_to_tests[change_dir]
for metric in sorted(metric_to_tests.keys()):
tests = metric_to_tests[metric]
msgs.append('Metric ' + change_dir + ' \'' + metric + '\':' + nltab + nltab.join(tests))
return '\n\n'.join(msgs)
# Formats a list of metrics into a string. Used e.g. to save metrics to a file or git note.
def format_perf_stat(stats):
# If a single stat, convert to a singleton list.
if not isinstance(stats, list):
stats = [stats]
return "\n".join(["\t".join([str(stat_val) for stat_val in stat]) for stat in stats])
# Appends a list of metrics to the git note of the given commit.
# Tries up to max_tries times to write to git notes should it fail for some reason.
# Each retry will wait 1 second.
# Returns True if the note was successfully appended.
def append_perf_stat(stats, commit='HEAD', namespace='perf', max_tries=5):
# Append to git note
print('Appending ' + str(len(stats)) + ' stats to git notes.')
stats_str = format_perf_stat(stats)
def try_append():
return subprocess.check_output(['git', 'notes', '--ref=' + namespace, 'append', commit, '-m', stats_str])
except subprocess.CalledProcessError:
return b'Git - fatal'
tries = 0