Runtime performance regression with binary operations on vectors

Let me briefly explain our use case: we have a machine learning tools created in Haskell. Basically it builds association rules for a database. In plain English: predicates on data rows. We need to score them, so we need to check how many rows are "matched" by the predicate.

In order to optimize performance, our code uses two main representations for data: one is a "human readable", where a values are real values. The second one is binarized representation for categorical data. The latter has is actually a family of representation, since we pack bits into tuples of Word64 and use bitwise operation to implement logic.

Simplified but representative code is attached.

In GHC 8.0.2 binary representation is faster by one order of magnitude than the "human readable" one:

➜  ghc-bug stack exec performance-bug-pair-1
"Generated"
benchmarking 256 columns/raw unbox vectors
time                 342.6 μs   (338.3 μs .. 348.0 μs)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 339.3 μs   (337.5 μs .. 342.0 μs)
std dev              7.203 μs   (5.596 μs .. 10.07 μs)
variance introduced by outliers: 13% (moderately inflated)

benchmarking 256 columns/binary packed
time                 32.07 μs   (29.69 μs .. 34.14 μs)
                     0.982 R²   (0.976 R² .. 0.997 R²)
mean                 29.97 μs   (29.41 μs .. 31.03 μs)
std dev              2.428 μs   (1.526 μs .. 3.750 μs)
variance introduced by outliers: 78% (severely inflated)

In GHC 8.2.2 (and later) binary representation performance is similar to "human readable", so no performance gain:

➜  ghc-bug stack exec performance-bug-pair-1
"Generated"
benchmarking 256 columns/raw unbox vectors
time                 442.4 μs   (406.7 μs .. 474.5 μs)
                     0.978 R²   (0.969 R² .. 0.993 R²)
mean                 399.3 μs   (391.3 μs .. 415.1 μs)
std dev              34.73 μs   (20.36 μs .. 53.29 μs)
variance introduced by outliers: 71% (severely inflated)

benchmarking 256 columns/binary packed
time                 378.6 μs   (295.8 μs .. 488.0 μs)
                     0.637 R²   (0.492 R² .. 0.780 R²)
mean                 568.1 μs   (437.1 μs .. 747.6 μs)
std dev              308.7 μs   (233.6 μs .. 386.1 μs)
variance introduced by outliers: 98% (severely inflated)

However, for certain compilation paths, we still can get similar speedup as with GHC 8.0.2. In the above examples we used 4-tuple of Word64 as binary representation. In the following code we run two tests: one on just Word64 and one of 4-tuple of Word64. The difference is that we just add the Word64 case:

➜  ghc-bug stack exec performance-bug-pair-2
"Generated"
benchmarking 64 columns/raw unbox vectors
time                 337.6 μs   (336.1 μs .. 339.3 μs)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 349.6 μs   (344.4 μs .. 359.7 μs)
std dev              23.22 μs   (15.39 μs .. 38.22 μs)
variance introduced by outliers: 60% (severely inflated)

benchmarking 64 columns/binary packed
time                 21.66 μs   (21.53 μs .. 21.79 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 21.68 μs   (21.53 μs .. 21.89 μs)
std dev              613.2 ns   (466.0 ns .. 806.0 ns)
variance introduced by outliers: 30% (moderately inflated)

benchmarking 256 columns/raw unbox vectors
time                 344.5 μs   (341.6 μs .. 348.2 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 345.1 μs   (342.5 μs .. 349.3 μs)
std dev              10.66 μs   (7.997 μs .. 16.34 μs)
variance introduced by outliers: 25% (moderately inflated)

benchmarking 256 columns/binary packed
time                 28.04 μs   (27.70 μs .. 28.46 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 28.05 μs   (27.85 μs .. 28.30 μs)
std dev              758.2 ns   (628.2 ns .. 907.6 ns)
variance introduced by outliers: 27% (moderately inflated)

I made a variant of code with simpler accumulating function (in our code we accumulate pair of integers, simplified accumulator work on one integer). GHC 8.2.2 in that case losses speed-up with 4-tuples:

➜  ghc-bug stack exec performance-bug-2
"Generated"
benchmarking 64 columns/raw unbox vectors
time                 333.8 μs   (333.0 μs .. 335.1 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 333.6 μs   (332.4 μs .. 335.8 μs)
std dev              5.651 μs   (3.233 μs .. 9.582 μs)

benchmarking 64 columns/binary packed
time                 39.06 μs   (38.98 μs .. 39.14 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 38.94 μs   (38.83 μs .. 39.14 μs)
std dev              495.0 ns   (310.2 ns .. 782.1 ns)

benchmarking 256 columns/raw unbox vectors
time                 336.9 μs   (336.2 μs .. 337.9 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 337.5 μs   (336.2 μs .. 339.8 μs)
std dev              5.757 μs   (2.946 μs .. 8.979 μs)

benchmarking 256 columns/binary packed
time                 239.8 μs   (237.6 μs .. 243.0 μs)
                     0.998 R²   (0.996 R² .. 0.999 R²)
mean                 251.4 μs   (247.9 μs .. 259.8 μs)
std dev              11.50 μs   (5.662 μs .. 19.79 μs)
variance introduced by outliers: 37% (moderately inflated)

In GHC 8.0.2 we have speed-up in both cases.

What may be related: using random-fu to generate random numbers seems to produce broken code on GHC 8.2.2 (the performance-bug-rfu.hs source).

Short description of attached sources:

performance-bug-pair-2.hs: using two binary representations
performance-bug-pair-1.hs: using one binary representation (one case commented)
performance-bug-1.hs: using one binary representation with simplified accumulator
performance-bug-2.hs: using two binary representation with simplified accumulator
performance-bug-rfu.hs: using random-fu to generate data (optional)
stack-8.0.yaml: stack file for GHC-8.0.2
stack-8.2.yaml: stack file for GHC-8.2.2
stack-nightly.yaml: stack file for GHC-8.4
performance-bug.cabal: to be able to stack build everything.

Trac metadata

Trac field	Value
Version	8.2.2
Type	Bug
TypeOfFailure	OtherFailure
Priority	normal
Resolution	Unresolved
Component	Compiler
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system
Architecture

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information