Unboxing-Related Performance Issue with Polymorphic Functions

My team has observed a 2x performance degradation in code that makes use of StateT that appears to be related to strictness and unboxing, even when built with -O2. Our code makes heavy use of the state monad, and when GHC fails to optimise this usage, the performance becomes untenably slow.

We've managed to minimise the behaviour to the following reproducer (that ignores state entirely), consisting of two files. It depends on criterion

Lib.hs:

{-# LANGUAGE BangPatterns #-}

module Lib where

-- A type to take the place of state
data X a = X { runX :: !a }

test1 :: Int -> Int
test1 = \(!i) -> go i where
    go = \(!i) -> if i > 0
        then go $! i - 1
        else i
{-# NOINLINE test1 #-}

test2 :: Int -> Int
test2 = \(!i) -> runX (go i) where
    go = \(!i) -> if i > 0
        then go $! i - 1
        else X i
{-# NOINLINE test2 #-}

Main.hs:

{-# LANGUAGE Strict #-}
module Main where

import Lib
import Criterion
import Criterion.Main

main :: IO ()
main = defaultMain
    [ bgroup "main"
        [ bgroup "small"
            [ bench "without state" $ whnf test1 100000000
            , bench "with state"    $ whnf test2 100000000
            ]
        ]
    ]

Run as above, the code takes twice as long to execute test2 as it does test1. However, when the signature for runX is changed to runX :: !Int, the code in test2 exhibits identical performance to test1.

It has been reproduced across multiple linux64 machines, but not tested on any other architecture or operating system.

Please find the full (stack - yes, I know) project as an attachment. You can simply stack run to observe the issue. If you have any further queries please let me know.

Edited Mar 10, 2019 by Ara Adkins

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information