Commit e683e462 authored by Ben Gamari's avatar Ben Gamari 🐢
Browse files

Initial commit of testsuite work blog post

parent 3955f31e
Pipeline #8125 passed with stages
in 12 minutes and 30 seconds
---
title: "Improvements in GHC's testsuite infrastructure"
author: Ben Gamari
date: "2019-07-08"
---
GHC's testsuite is our first line of defense against correctness regressions.
However, as is often the case, the infrastructure that keeps it running has
been long neglected. Our recent efforts in enforcing a CI-cleanliness in all
GHC builds has resulted in a few bits of work that I thought would be nice to
share.
# Improving testsuite driver maintainability
GHC's testsuite is a collection of Makefiles and Haskell programs all glued
together with a [clump of
Python](https://gitlab.haskell.org/ghc/ghc/tree/master/testsuite/driver) known
as the testsuite driver. The testsuite driver is a clever albeit quirky piece
of software which, despite its implementation language, displays the marks of a
codebase written by a functional programmer. It defines a small embedded DSL
for describing tests , the user-facing interface of which is
generally declarative. For instance, a simple test definition might look like:
```python
test('T13618', normal, compile_and_run, ['-v0'])
```
To provide this succinct language, the implementation relies on a number of
clever tricks, often involving global variables, functions returning functions,
and good helping of mutation.
Of course, all of these clever tricks are implemented in a dynamically-typed
language in a codebase that has evolved over the course of nearly 20 years (the
first
[commit](https://gitlab.haskell.org/ghc/ghc/commit/e5063a042c9a1701ea7273da7bacb530d5c077d3)
to `testlib.py` was by Simon Marlow in 2003). In its nearly 5kLoC
implementation one will find integers used as booleans, magic
strings, undeclared global variables, variables that are sometimes thread-local
yet elsewhere considered global, heaps of string concatenation, and numerous
other curiosities. Needless to say, comprehending, let alone
modifying, the testsuite driver has been getting harder and harder with every
accrued feature.
In recent years the Python community has gradually awoken to the
promise of statically-checked types. Python 3's type annotation syntax in
conjunction with the [mypy](http://mypy-lang.org/) typechecker has given us a
path to bringing order to this cleverness. `mypy` implements a pleasantly complete
type system in which most any Haskell 98 user would feel at home, supporting
for newtypes, sums, and parametric polymorphism.
Over the last few weeks the driver has grown hundreds of lines of type
annotations, all checked by `mypy`. Not only has this made the codebase
significantly more readable but the process unearthed a few latent bugs as well.
Of course, to ensure that the testsuite driver doesn't regress, it is now
[typechecked](https://gitlab.haskell.org/ghc/ghc/commit/95f56853d6288076551a5326ad7b4778742a51ce)
during the `lint` stage of GHC's CI pipeline.
In addition to typechecking, we took this opportunity to perform some
long-overdue refactoring to use modern Python interfaces (e.g. `pathlib`
instead of strings, `bool` instead of `int`, use of `None` where appropriate)
and added numerous assertions (revealing yet more unnoticed bugs, some of which
were silently causing tests to be inappropriately skipped or run).
# Fragile and broken tests
GHC's testsuite has a sizeable configuration space, with over 30 "ways" in
which tests may be run (e.g. normal, profiled, with the threaded RTS, etc.) and
a few "speeds" which select a subset of tests to be run.
For most of its existence GHC's CI infrastructure (in its various forms) has
tested only a small subset of the testsuite tests (namely the `normal` speed
with around a dozen of the ways enabled). While we have periodically looked
at the full `slow` testsuite output, rarely were we able to make significant
headway in fixing the many issues we found.
Recently our new CI infrastructure has placed a renewed emphasis on
improving testsuite coverage by regularly (e.g. at least on a nightly basis)
testing the entire testsuite (as well as `nofib`, our performance testsuite).
This, of course, meant fixing the hundreds of failing tests in the full
testsuite run.
These failures generally broke down into a few classes:
1. tests that are themselves buggy
2. tests on which GHC has regressed (but we hadn't noticed due to the test not
being run in the default testsuite configuration)
3. tests which are broken on in certain ways
4. tests which fail non-deterministically in some or all ways
To handle cases (1-3) GHC has long had an `expect_broken` test modifier to
mark a test as known to be broken due to a particular ticket, e.g.:
```python
test('T13366',
[when(opsys('darwin'), expect_broken(16083))],
compile_and_run,
['-lstdc++ -v0'])
```
This modifier causes the test to be run, failing if it somehow successfully
finishes (to ensure that we notice if the test is inadvertently fixed).
Moreover, it encodes the fact that #16083 is the ticket where the breakage is
documented.
# Tracking fragile test outcomes
Sadly, the `expect_broken` mechanism is not appropriate for fragile tests,
which may pass or fail nondeterministically.
For this case we have introduced a new `fragile` modifier which runs the test but
merely takes note of whether it passed. We can then report this information in
two places:
* In the testsuite report printed at the end of the run, ensuring that it
shows up in the testsuite log
* In the JUnit report produced by the testsuite run.
While `fragile` is a step up from simply skipping broken tests, we clearly want
to ensure that the desigation is eventually removed from tests that have been
stabilized (intentionally or not).
For a few weeks we tried checking for such tests by hand: manually examining
recent testsuite logs and looking for patterns of fragile tests which routinely
pass. However, not only was this time-consuming but it was also quite
error-prone as some of our fragile tests pass over 90% of the time.
For this reason we have developed [a tiny bit of
automation](https://gitlab.haskell.org/ghc/ghc-bot/blob/master/test_tracking/ingest_junit.py)
to help with this process: a GitLab webhook ingests the JUnit output from each
testsuite run into a relational database for later analysis. This turns the
previously-arduous task of identifying no-longer-fragile tests into a simple
SQL query:
```sql
SELECT *
FROM results_view AS x
WHERE message = 'fragile'
AND NOT EXISTS (SELECT *
FROM results_view
WHERE test_name != x.test_name
AND other->'reason' = 'fragile fail'
AND date > now() - interval '4 day';
)
```
# Diagnosing fragile tests
One of the challenges in fixing fragile tests is characterising how they fail.
Frequently, fragile tests have several failure modes. Knowing the differences
and similarities between these modes can be remarkably helpful in localizing
the root cause of the failure. Moreover, it's not uncommon for failure modes to
be shared by multiple fragile tests. Unfortunately, identifying and correlating
these modes can be quite time-consuming, especially with infrequently-failing
tests.
To aid in this we extended the testsuite driver's JUnit output to include
failing test output. This is then persisted in the test tracking database
described above for later reference. This information has already proven to be
incredibly useful.
# An unexpected clean-up
One unexpected source of improvement in our testsuite performance recently
arose from our recent
[addition](https://gitlab.haskell.org/ghc/ghc/merge_requests/1159) of an Alpine
Linux CI target (#14502). Alpine, unlike the other Linux platforms that we test
on, uses the `musl` C runtime implementation. Due to differences in `musl`'s
file flushing semantics, testing on Alpine immediately shed light on a few
subtly fragile testcases which we had previously noticed but had not yet
investigated.
This was a helpful reminder that writing and testing software for portability
is not only good for users but it can also improve correctness by shedding
light on subtly-flawed assumptions.
# Closing
In this post we discussed a few of measures we have taken to improve and
maintain GHC's correctness while lowering maintenance overhead. By tackling
long-standing technical debt and putting in place tools to correlate test
failures across testsuite runs, testcases, and time, we have gained
significantly deeper insight into when, where, and how our tests are failing.
We look forward to sharing similar steps being taken on the compiler
performance front in a future post.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment