GHC generates poor code for repeated uses of min/max
Consider the following module, which intends to implement a branchless ray-AABB intersection test:
module SimpleGeom where
data Vec3 = Vec3
{-# UNPACK #-} !Double
{-# UNPACK #-} !Double
{-# UNPACK #-} !Double
data Ray = Ray !Vec3 !Vec3 !Vec3
data AABB = AABB !Vec3 !Vec3
testRayAABBIntersection :: Ray -> AABB -> Bool
testRayAABBIntersection (Ray (Vec3 ox oy oz) _ (Vec3 invDx invDy invDz))
(AABB (Vec3 minX minY minZ) (Vec3 maxX maxY maxZ)) =
let tx1 = (minX - ox) * invDx
tx2 = (maxX - ox) * invDx
ty1 = (minY - oy) * invDy
ty2 = (maxY - oy) * invDy
tz1 = (minZ - oz) * invDz
tz2 = (maxZ - oz) * invDz
tmin = min tx1 tx2 `max` min ty1 ty2 `max` min tz1 tz2
tmax = max tx1 tx2 `min` max ty1 ty2 `min` max tz1 tz2
in tmax >= max 0 tmin
Everything is strict primitive operations, so GHC should generate very simple, fast code, right? But upon compiling with ghc -O -ddump-simpl -ddump-to-file SimpleGeom, I found a mess of nested local lambdas and similar performance-killing expression forms. (See the attached output file.)
There are two possible issues I can see here.
- GHC is trying to expand out all of the branches recursively (I would presume via case-of-cases transformation), which is a bad idea in this instance compared to just performing the cases sequentially and storing their results.
- GHC is generating branches for floating-point min/max. Instruction sets like SSE2 include non-branching floating-point min/max instructions, which is exactly what this algorithm was designed to exploit, but GHC does not generate code that could take advantage of these instructions.
Trac metadata
| Trac field | Value |
|---|---|
| Version | 7.8.2 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | Windows |
| Architecture |