Enum instance for IntX / WordX are inefficient

The following factorial benchmark show high discrepancy in performance between Int and Int64 (and similarly between Word and Word64) when using product [1..n] to compute factorial:

{-# LANGUAGE TypeApplications #-}
import Criterion.Main

import Data.Int
import Data.Word
import Numeric.Natural

fact :: Integral t => t -> t
fact n = product [1..n]

fact2 :: Integral t => t -> t
fact2 n = go 1 n
  where go acc 1 = acc
        go acc n = go (acc * n) (n - 1)

main :: IO ()
main = defaultMain [
  bgroup "fact 20" [
     bench "Int"  $ whnf (fact @Int) 20
     , bench "Word"  $ whnf (fact @Word) 20
     , bench "Int64"  $ whnf (fact @Int64) 20
     , bench "Word64"  $ whnf (fact @Word64) 20
     , bench "Integer"  $ whnf (fact @Integer) 20
     , bench "Natural"  $ whnf (fact @Natural) 20
               ]
  ,
  bgroup "fact2 20" [
     bench "Int"  $ whnf (fact @Int) 20
     , bench "Word"  $ whnf (fact @Word) 20
     , bench "Int64"  $ whnf (fact @Int64) 20
     , bench "Word64"  $ whnf (fact @Word64) 20
     , bench "Integer"  $ whnf (fact @Integer) 20
     , bench "Natural"  $ whnf (fact @Natural) 20
               ]
  ]

benchmarking fact 20/Int
time                 17.33 ns   (17.22 ns .. 17.41 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 17.21 ns   (17.16 ns .. 17.29 ns)
std dev              222.4 ps   (184.5 ps .. 282.3 ps)
variance introduced by outliers: 15% (moderately inflated)

benchmarking fact 20/Word
time                 17.25 ns   (17.01 ns .. 17.63 ns)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 17.19 ns   (17.11 ns .. 17.36 ns)
std dev              358.6 ps   (199.4 ps .. 674.3 ps)
variance introduced by outliers: 32% (moderately inflated)

benchmarking fact 20/Int64
time                 166.7 ns   (165.4 ns .. 168.5 ns)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 166.8 ns   (166.0 ns .. 168.5 ns)
std dev              4.077 ns   (1.854 ns .. 6.592 ns)
variance introduced by outliers: 35% (moderately inflated)

benchmarking fact 20/Word64
time                 560.5 ns   (550.6 ns .. 575.0 ns)
                     0.995 R²   (0.989 R² .. 1.000 R²)
mean                 556.5 ns   (549.1 ns .. 575.9 ns)
std dev              36.08 ns   (9.195 ns .. 60.69 ns)
variance introduced by outliers: 78% (severely inflated)

benchmarking fact 20/Integer
time                 421.9 ns   (419.8 ns .. 425.3 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 427.7 ns   (425.5 ns .. 431.3 ns)
std dev              9.238 ns   (6.311 ns .. 16.19 ns)
variance introduced by outliers: 28% (moderately inflated)

benchmarking fact 20/Natural
time                 534.5 ns   (532.6 ns .. 536.3 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 533.7 ns   (532.6 ns .. 535.3 ns)
std dev              4.535 ns   (3.652 ns .. 6.226 ns)

benchmarking fact2 20/Int
time                 17.14 ns   (17.07 ns .. 17.23 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 17.15 ns   (17.10 ns .. 17.24 ns)
std dev              227.1 ps   (167.4 ps .. 286.9 ps)
variance introduced by outliers: 16% (moderately inflated)

benchmarking fact2 20/Word
time                 16.93 ns   (16.86 ns .. 17.02 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 16.90 ns   (16.86 ns .. 16.95 ns)
std dev              149.8 ps   (107.4 ps .. 192.5 ps)

benchmarking fact2 20/Int64
time                 165.1 ns   (164.6 ns .. 165.7 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 164.8 ns   (164.5 ns .. 165.1 ns)
std dev              1.033 ns   (778.6 ps .. 1.363 ns)

benchmarking fact2 20/Word64
time                 545.3 ns   (542.8 ns .. 547.9 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 543.6 ns   (542.4 ns .. 545.2 ns)
std dev              4.741 ns   (3.736 ns .. 5.824 ns)

benchmarking fact2 20/Integer
time                 419.1 ns   (417.2 ns .. 421.1 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 418.2 ns   (416.8 ns .. 420.2 ns)
std dev              5.415 ns   (4.094 ns .. 7.147 ns)
variance introduced by outliers: 12% (moderately inflated)

benchmarking fact2 20/Natural
time                 533.8 ns   (532.2 ns .. 536.1 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 533.2 ns   (532.1 ns .. 534.8 ns)
std dev              4.428 ns   (3.217 ns .. 5.741 ns)

In fact, Int and Word are efficient, Int64 and Word64 are respectively 10x and 20x slowers, when they should be identical on 64 bit platform. Word64 is even slower than Integer!

Replacing fact by a handwritten recursion, in fact2 gives something more coherent where all the 64 bits types have roughly the same efficiency.

I observed in the source code that Int and Int64 are not implemented as type synonym.

Using -ddump-rule-firings, I observed that a rule fold/build fired and removed the fold for the Int case, and not for the Int64, I don't understand much ;(

This is not dramatic because most people will use Int instead of Int64, but it is a surprising behavior. I also observed that smaller integral (such as Int8, Int16, Int32) behaves as badly as Int64) and this is much more important because there is no other default and more efficient type synonym.

Note that I'll be interested to provide a patch myself if mentored a bit. Especially, I'm thinking about merging the Int and Int64 implementation (with respect to the architecture default word size).

Tested on GHC 8.4.2

Trac metadata

Trac field	Value
Version	8.4.2
Type	Bug
TypeOfFailure	OtherFailure
Priority	normal
Resolution	Unresolved
Component	Core Libraries
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system
Architecture

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information