Performance with O0 is much better than the default or with -O2, runghc performs the best
In this particular case -O2 or the default is 2x slower than -O0 and -O0 is 2x slower than runghc. Please see the github repo: https://github.com/harendra-kumar/ghc-perf to reproduce the issue. Readme file in the repo has instructions to reproduce.
The issue seems to occur when the code is placed in a different module. When all the code is in the same module the problem does not occur. In that case -O2 or the default is faster than -O0. However, when the code is split into two modules the performance gets inverted.
Also, it does not occur always, when I tried to change the code to make it simpler for repro the problem did not occur.
Edited by harendra