## Runtime performance regression with binary operations on vectors

Let me briefly explain our use case: we have a machine learning tools created in Haskell. Basically it builds association rules for a database. In plain English: predicates on data rows. We need to score them, so we need to check how many rows are "matched" by the predicate.

In order to optimize performance, our code uses two main representations for data: one is a "human readable", where a values are real values. The second one is binarized representation for categorical data. The latter has is actually a family of representation, since we pack bits into tuples of Word64 and use bitwise operation to implement logic.

Simplified but representative code is attached.

In GHC 8.0.2 binary representation is faster by one order of magnitude than the "human readable" one:

```
➜ ghc-bug stack exec performance-bug-pair-1
"Generated"
benchmarking 256 columns/raw unbox vectors
time 342.6 μs (338.3 μs .. 348.0 μs)
0.999 R² (0.998 R² .. 1.000 R²)
mean 339.3 μs (337.5 μs .. 342.0 μs)
std dev 7.203 μs (5.596 μs .. 10.07 μs)
variance introduced by outliers: 13% (moderately inflated)
benchmarking 256 columns/binary packed
time 32.07 μs (29.69 μs .. 34.14 μs)
0.982 R² (0.976 R² .. 0.997 R²)
mean 29.97 μs (29.41 μs .. 31.03 μs)
std dev 2.428 μs (1.526 μs .. 3.750 μs)
variance introduced by outliers: 78% (severely inflated)
```

In GHC 8.2.2 (and later) binary representation performance is similar to "human readable", so no performance gain:

```
➜ ghc-bug stack exec performance-bug-pair-1
"Generated"
benchmarking 256 columns/raw unbox vectors
time 442.4 μs (406.7 μs .. 474.5 μs)
0.978 R² (0.969 R² .. 0.993 R²)
mean 399.3 μs (391.3 μs .. 415.1 μs)
std dev 34.73 μs (20.36 μs .. 53.29 μs)
variance introduced by outliers: 71% (severely inflated)
benchmarking 256 columns/binary packed
time 378.6 μs (295.8 μs .. 488.0 μs)
0.637 R² (0.492 R² .. 0.780 R²)
mean 568.1 μs (437.1 μs .. 747.6 μs)
std dev 308.7 μs (233.6 μs .. 386.1 μs)
variance introduced by outliers: 98% (severely inflated)
```

However, for certain compilation paths, we still can get similar speedup as with GHC 8.0.2. In the above examples we used 4-tuple of Word64 as binary representation. In the following code we run two tests: one on just Word64 and one of 4-tuple of Word64. The difference is that we just add the Word64 case:

```
➜ ghc-bug stack exec performance-bug-pair-2
"Generated"
benchmarking 64 columns/raw unbox vectors
time 337.6 μs (336.1 μs .. 339.3 μs)
0.999 R² (0.998 R² .. 1.000 R²)
mean 349.6 μs (344.4 μs .. 359.7 μs)
std dev 23.22 μs (15.39 μs .. 38.22 μs)
variance introduced by outliers: 60% (severely inflated)
benchmarking 64 columns/binary packed
time 21.66 μs (21.53 μs .. 21.79 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 21.68 μs (21.53 μs .. 21.89 μs)
std dev 613.2 ns (466.0 ns .. 806.0 ns)
variance introduced by outliers: 30% (moderately inflated)
benchmarking 256 columns/raw unbox vectors
time 344.5 μs (341.6 μs .. 348.2 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 345.1 μs (342.5 μs .. 349.3 μs)
std dev 10.66 μs (7.997 μs .. 16.34 μs)
variance introduced by outliers: 25% (moderately inflated)
benchmarking 256 columns/binary packed
time 28.04 μs (27.70 μs .. 28.46 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 28.05 μs (27.85 μs .. 28.30 μs)
std dev 758.2 ns (628.2 ns .. 907.6 ns)
variance introduced by outliers: 27% (moderately inflated)
```

I made a variant of code with simpler accumulating function (in our code we accumulate pair of integers, simplified accumulator work on one integer). GHC 8.2.2 in that case losses speed-up with 4-tuples:

```
➜ ghc-bug stack exec performance-bug-2
"Generated"
benchmarking 64 columns/raw unbox vectors
time 333.8 μs (333.0 μs .. 335.1 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 333.6 μs (332.4 μs .. 335.8 μs)
std dev 5.651 μs (3.233 μs .. 9.582 μs)
benchmarking 64 columns/binary packed
time 39.06 μs (38.98 μs .. 39.14 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 38.94 μs (38.83 μs .. 39.14 μs)
std dev 495.0 ns (310.2 ns .. 782.1 ns)
benchmarking 256 columns/raw unbox vectors
time 336.9 μs (336.2 μs .. 337.9 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 337.5 μs (336.2 μs .. 339.8 μs)
std dev 5.757 μs (2.946 μs .. 8.979 μs)
benchmarking 256 columns/binary packed
time 239.8 μs (237.6 μs .. 243.0 μs)
0.998 R² (0.996 R² .. 0.999 R²)
mean 251.4 μs (247.9 μs .. 259.8 μs)
std dev 11.50 μs (5.662 μs .. 19.79 μs)
variance introduced by outliers: 37% (moderately inflated)
```

In GHC 8.0.2 we have speed-up in both cases.

What may be related: using random-fu to generate random numbers seems to produce broken code on GHC 8.2.2 (the `performance-bug-rfu.hs`

source).

Short description of attached sources:

- performance-bug-pair-2.hs: using two binary representations
- performance-bug-pair-1.hs: using one binary representation (one case commented)
- performance-bug-1.hs: using one binary representation with simplified accumulator
- performance-bug-2.hs: using two binary representation with simplified accumulator
- performance-bug-rfu.hs: using random-fu to generate data (optional)
- stack-8.0.yaml: stack file for GHC-8.0.2
- stack-8.2.yaml: stack file for GHC-8.2.2
- stack-nightly.yaml: stack file for GHC-8.4
- performance-bug.cabal: to be able to stack build everything.

## Trac metadata

Trac field | Value |
---|---|

Version | 8.2.2 |

Type | Bug |

TypeOfFailure | OtherFailure |

Priority | normal |

Resolution | Unresolved |

Component | Compiler |

Test case | |

Differential revisions | |

BlockedBy | |

Related | |

Blocking | |

CC | |

Operating system | |

Architecture |