Demand Analysis scales quadratically in the size of the program

While investigating #14816, I realised that Demand analysis takes quadratic time and memory for the following kind of program:

{-# LANGUAGE NoImplicitPrelude #-}

module GHC.Ix where

import GHC.Base


class (Ord a) => Ix a where
    range :: (a,a) -> [a]

instance  (Ix a1, Ix a2, Ix a3, Ix a4, Ix a5, Ix a6, Ix a7, Ix a8, Ix a9,
           Ix aA, Ix aB, Ix aC, Ix aD, Ix aE, Ix aF) =>
      Ix (a1,a2,a3,a4,a5,a6,a7,a8,a9,aA,aB,aC,aD,aE,aF)  where
    range ((l1,l2,l3,l4,l5,l6,l7,l8,l9,lA,lB,lC,lD,lE,lF),
           (u1,u2,u3,u4,u5,u6,u7,u8,u9,uA,uB,uC,uD,uE,uF)) =
      [(i1,i2,i3,i4,i5,i6,i7,i8,i9,iA,iB,iC,iD,iE,iF) | i1 <- range (l1,u1),
                                                        i2 <- range (l2,u2),
                                                        i3 <- range (l3,u3),
                                                        i4 <- range (l4,u4),
                                                        i5 <- range (l5,u5),
                                                        i6 <- range (l6,u6),
                                                        i7 <- range (l7,u7),
                                                        i8 <- range (l8,u8),
                                                        i9 <- range (l9,u9),
                                                        iA <- range (lA,uA),
                                                        iB <- range (lB,uB),
                                                        iC <- range (lC,uC),
                                                        iD <- range (lD,uD),
                                                        iE <- range (lE,uE),
                                                        iF <- range (lF,uF)]

Fortunately, this is already the largest instance in GHC.Ix. But these are the numbers for 13-,14-,15- and 16-tuples, respectively:

n	DmdAnal alloc
13	46.8
14	52.5
15	58.5
16	65.1

Here's a quadratic regression: https://www.desmos.com/calculator/jxnxfp6hte

As you can see, if we had an instance for a 100-tuple, we'd have over 2GB of allocs. OK, still not that bad, but diagnosing and possibly fixing the issue might drive to improved DmdAnal perf for large programs.

Unfortunately, I think the quadratic behavior is quite fundamental due to the nature of fixed-point iteration. For a triple, I see generated code like

$crange_aKj
  = \ @a1_aK8 @a2_aK9 @a3_aKa $dIx_aKb $dIx_aKc $dIx_aKd ds_dN4 ->
      case ds_dN4 of { (ds_dNg, ds_dNh) ->
      case ds_dNg of { (l1_aji, l2_ajj, l3_ajk) ->
      case ds_dNh of { (u1_ajl, u2_ajm, u3_ajn) ->
      let { lvl_sP4 = range $dIx_aKc (l2_ajj, u2_ajm) } in
      let { lvl_sP3 = range $dIx_aKd (l3_ajk, u3_ajn) } in
      letrec {
        go1_aNT
          = \ ds_aNU ->
              case ds_aNU of {
                [] -> [];
                : y_aNX ys_aNY ->
                  let { z_X4 = go1_aNT ys_aNY } in
                  letrec {
                    go1_X5
                      = \ ds_X6 ->
                          case ds_X6 of {
                            [] -> z_X4;
                            : y_X8 ys_X9 ->
                              let { z_Xa = go1_X5 ys_X9 } in
                              letrec {
                                go1_Xb
                                  = \ ds_Xc ->
                                      case ds_Xc of {
                                        [] -> z_Xa;
                                        : y_Xe ys_Xf -> : (y_aNX, y_X8, y_Xe) (go1_Xb ys_Xf)
                                      }; } in
                              go1_Xb lvl_sP3
                          }; } in
                  go1_X5 lvl_sP4
              }; } in
      go1_aNT (range $dIx_aKb (l1_aji, u1_ajl))
      }
      }
      }

This is the "call stack" in which we'll analyse RHSs of go1_* bindings for signatures:

aNT first
- X5 first
  - Xb first
  - Xb second: stable
- X5 second..
  - Xb: still stable
- ..X5: stable
aNT second...
- X5..
  - Xb: still stable
- ..X5: still stable
...aNT: stable

The innermost binding Xb is analysed 4 times. In general, for an n-tuple, the innermost binding will be analysed n+1 times, the next outer binding n times, etc. Note that the "cunning plan" in Note [Initialising Strictness], which reuses signatures between iterations, saves us from scaling exponentially here. Still, we scale quadratically.

While I doubt we can do better in general without conservative approximation, note that in this example, Xb doesn't actually call X5 (or aNT)! It should not need to be analysed once it's stable. The recursive occurence is in the thunks z_*, which are then referenced from the nested loops. We use LetUp for thunks, so there is no way that the RHS of z_Xa can influence the demand signature of go_Xb. If we had FV annotations as suggested in #19538, we could save re-analysing Xb simply by knowing that the unstable outer bindings aNT and X5 don't occur free in Xb.

I'm a bit doubtful that we can make that plan happen easily. Ultimately what we want is more of a finer-grained dependency graph for fixed-point iteration, like I used in my master's thesis: https://pp.ipd.kit.edu/uploads/publikationen/graf17masterarbeit.pdf#section.4.8, but that was VERY complicated and unfortunately has huge constants.

Edited May 31, 2021 by Sebastian Graf

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information