Severe performance degradation on Linux
Summary
While trying to pack and deploy a computational intense code for serverless (AWS Lambda/GC Run) I stumbled on a severer performance degradation. The exactly same code that ran on about 30s (single thread) on my local machine (MacBook Pro) would timeout on the serverless limit of 15min.
I setup a Criterion benchmark and identified the function that was causing the performance degradation by comparing comparing with the results on Linux docker container. After I also tested on an older MacBook with native Linux installation and with different versions of GHC (8.6.5 and 8.8.3). All of the benchmarks under Linux showed the same behavior of heavily performance degradation in some specific parts of the code.
Deep Dive
Using GHC profile and Criterion (also comparing between native OSX and containerized Ubuntu), I managed to isolate the function that was suffering the performance degradation. The function called misoDoubleOR
was going from 370 μs
to 12.7 ms
, in other words the Linux executable was 34x
slower. And since this function is part of the main loop, it was causing the overall performance issue.
I started to benchmark other functions used by misoDoubleOR
but none of them suffered the same level of degradation when compiled on Linux. I was almost hopeless about finding what was causing such a problem when stumble on this issue #17881 and it came to my mind that I was using eta to memoize some repeated calculation.
After removing the eta function, the execution time dropped to more acceptable levels (2x
slower).
getMisoAngleEta :: Symm -> Quaternion -> Quaternion -> Double
getMisoAngleEta symm = let
foo = getAbsShortOmega . getInFZ (getSymmOps symm)
-- avoiding eta expansion of q1 and q2 to memorize
in \q1 q2 -> foo (q2 -#- q1)
>> Darwin Kernel Version 19.2.0: root:xnu-6153.61.1~20/RELEASE_X86_64 x86_64
>> GHC 8.6.5 (stack lts-14.27)
arche > benchmarks
Running 1 benchmarks...
Benchmark arche-bench: RUNNING...
benchmarking reference/fib
time 74.88 ms (71.96 ms .. 76.36 ms)
0.998 R² (0.992 R² .. 1.000 R²)
mean 77.33 ms (75.98 ms .. 78.96 ms)
std dev 2.527 ms (1.844 ms .. 3.340 ms)
benchmarking reference/malloc
time 7.942 ms (7.229 ms .. 8.772 ms)
0.935 R² (0.899 R² .. 0.971 R²)
mean 6.907 ms (6.640 ms .. 7.335 ms)
std dev 1.018 ms (711.4 μs .. 1.415 ms)
variance introduced by outliers: 76% (severely inflated)
benchmarking misoDoubleOR/with-eta
time 370.6 μs (357.9 μs .. 385.5 μs)
0.984 R² (0.971 R² .. 0.994 R²)
mean 383.9 μs (369.8 μs .. 402.7 μs)
std dev 53.30 μs (40.16 μs .. 71.08 μs)
variance introduced by outliers: 87% (severely inflated)
benchmarking misoDoubleOR/no-eta
time 239.2 μs (230.5 μs .. 247.6 μs)
0.977 R² (0.954 R² .. 0.989 R²)
mean 261.7 μs (249.9 μs .. 281.7 μs)
std dev 54.43 μs (38.19 μs .. 79.35 μs)
variance introduced by outliers: 95% (severely inflated)
>> Ubuntu 18.04 (Docker)
>> GHC 8.6.5 (stack lts-14.27)
Running 1 benchmarks...
Benchmark arche-bench: RUNNING...
benchmarking reference/fib
time 144.0 ms (141.2 ms .. 146.9 ms)
0.999 R² (0.997 R² .. 1.000 R²)
mean 148.9 ms (146.0 ms .. 156.4 ms)
std dev 6.589 ms (1.772 ms .. 9.655 ms)
variance introduced by outliers: 12% (moderately inflated)
benchmarking reference/malloc
time 7.787 ms (7.690 ms .. 7.964 ms)
0.998 R² (0.995 R² .. 1.000 R²)
mean 7.833 ms (7.780 ms .. 7.919 ms)
std dev 183.9 μs (120.9 μs .. 278.8 μs)
benchmarking misoDoubleOR/with-eta
time 12.73 ms (12.56 ms .. 12.98 ms)
0.998 R² (0.995 R² .. 1.000 R²)
mean 12.78 ms (12.71 ms .. 12.92 ms)
std dev 272.3 μs (154.4 μs .. 449.2 μs)
benchmarking misoDoubleOR/no-eta
time 739.8 μs (735.3 μs .. 746.2 μs)
0.999 R² (0.998 R² .. 1.000 R²)
mean 744.4 μs (739.7 μs .. 757.3 μs)
std dev 23.24 μs (5.377 μs .. 48.14 μs)
variance introduced by outliers: 21% (moderately inflated)
Steps to reproduce
I couldn't isolate the behavior in a simplified code but have put these benchmark on a branch.
git clone git@github.com:arche-tool/arche.git
cd arche
git submodule update --init
stack install
To run the benchmark inside a container:
make run-bench
Expected behavior
I would expected:
- Similar runtimes across different OSs
- Possibility of using eta reduction to memoize values without performance penalty
Environment
- GHC version used: 8.6.5 and 8.8.3
- Operating System: OSX and Ubuntu