Optimizer misses unboxing opportunity

In my studying of the fannkuch benchmark, I've discovered (I think) another missed optimization. A scaled down illustration goes as follows:

{-# LANGUAGE TypeOperators, BangPatterns #-}

module Main (main) where

import Control.Monad.ST

import System.Environment

data (:*:) a b = !a :*: !b

whileLoop :: Int -> ST s Int
whileLoop = go 0
 where
 go !n k
   | k == 0    = return n
   | otherwise = go (n+1) (k-1)
{-# INLINE whileLoop #-}

iter :: Int -> Int -> ST s (Bool :*: Int)
iter k n = do
  k' <- whileLoop 40 >>= \k' -> return $! max k k'
  b <- return (n == 0)
  
  return $! b :*: k'
{-# INLINE iter #-}

mainLoop :: Int -> Int -> ST s Int
mainLoop k n = do
  done :*: k' <- iter k n
  
  if done
    then return k'
    else mainLoop k' (n - 1)

main = print =<< stToIO . mainLoop 0 . read . head =<< getArgs

If we look at the core for whileLoop's worker, we see:

$wpoly_go_r1aE :: forall s_aem.
                  Int# -> Int# -> STRep s_aem Int

$wpoly_go_r1aE =
  \ (@ s_aem)
    (ww_s18Y :: Int#)
    (ww1_s192 :: Int#)
    (eta_s19w :: State# s_aem) ->
    case ww1_s192 of wild_XF {
      __DEFAULT ->
        $wpoly_go_r1aE @ s_aem (+# ww_s18Y 1) (-# wild_XF 1) eta_s19w;
      0 -> (# eta_s19w, I# ww_s18Y #)
    }

Note, the return type is a boxed Int. The function is only used once, like so:

    ...
    case $wpoly_go_r1aE @ s_aem 0 40 w_s19f of wild_aFw { (# new_s_aFB, r_aFC #) ->
    case r_aFC of wild1_aG9 { I# y_aGb ->
    case <=# ww_s199 y_aGb of wild2_aGd {
      False ->
    ...

In other words, go boxes its results at the end of the loop, and the function that uses it immediately looks inside the box for the value. In this particular micro-benchmark, the boxed value (wild1_aG9 above) is actually used in the case where mainLoop returns the boxed value. However, in the larger benchmark I pulled this from, that is not the case in several areas (only the unboxed value is used, but it still goes through a box). Either way, the boxed value is only used at the end of the loop (and not every time there), and on every other iteration, this results in superfluous allocation.

I'll attach a manually unboxed version I wrote (it also has iter and max manually inlined to mainLoop, since that makes it easier to write; the core for the above shows that they are getting inlined properly, so I assume that isn't the issue). Using +RTS -sstderr shows that (running 100 million iterations here), the manually unboxed version allocates 50 kilobytes on the heap, and runs in around 15 seconds, whereas the version above that doesn't get unboxed does 1.6 gigabytes of heap allocation, and takes 18 seconds (in the larger benchmark, such extra boxing would happen perhaps 40 million times over the course of the program).

Thanks for your help.

Trac metadata

Trac field	Value
Version	6.8.2
Type	Bug
TypeOfFailure	OtherFailure
Priority	normal
Resolution	Unresolved
Component	Compiler
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system
Architecture

Edited Apr 01, 2019 by Ben Gamari

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information