Simon-nofib-notes 7.59 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
	--------------------------------------------
		Notes on nofib programs
	--------------------------------------------

Every time I do performance tuning on GHC I re-discover wrinkles in
particular nofib programs.   I intend to record these wrinkles herein.


NOTE: With 4.09 we introduced Unicode.  That means that (C# x) always allocates,
whereas it didn't before.  So allocations go up a bit.

---------------------------------------
	Imaginary suite
---------------------------------------


gen_regexps
~~~~~~~~~~~
I found that there were some very bad loss-of-arity cases in PrelShow.  
  In particular, we had:

	showl ""       = showChar '"' s
	showl ('"':xs) = showString "\\\"" . showl xs
	showl (x:xs)   = showLitChar x . showl xs

  Trouble is, we get
	showl = \xs -> case xs of
			  ...
			  (x:xs) -> let f = showLitChar x
					g = showl xs
				    in \s -> f (g x)
  which is TERRIBLE.  We can't spot that showLitChar has arity 2 because
  it looks like this:

	...other eqns...
        showLitChar c = showString ('\\' : asciiTab!!ord c)

  notice that the (asciiTab!!orc c) is outside the \s, so GHC can't rewrite it to

	showLitChar c =  \s -> showString ('\\' : asciiTab!!ord c) s

  So I've changed PrelShow.showLitChar to use explicit \s.  Even then, showl
  doesn't work, because GHC can't see that showl xs can be pushed inside the \s.
  So I've put an explict \s there too.  

	showl ""       s = showChar '"' s
	showl ('"':xs) s = showString "\\\"" (showl xs s)
	showl (x:xs)   s = showLitChar x (showl xs s)

  Net result: imaginary/gen_regexps more than halves in allocation!


53 54 55 56 57 58 59
x2n1
~~~~

Relies quite heavily on specialisations for Num (Complex Double).  If
this test suddenly gets worse by a factor of 2 or so, chances are that
specialisation is broken.

60 61 62 63
---------------------------------------
	Spectral suite
---------------------------------------

simonpj@microsoft.com's avatar
simonpj@microsoft.com committed
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
cse
~~~
The functions 'go' and 'go1', called from draw, get arity 1, but they
should get arity 2. We need a better arity analysis, a la Gill. 'go' looks
like this:
    go_r1og =
      \ (ds_aGI :: [GHC.Base.Char]) ->
        case ds_aGI of wild_aGJ {
          [] -> z_r1oe;
          : y_aGN ys_aGO ->
    	let {
    	  xs7_s1i8 :: GHC.Prim.Int# -> [GHC.Base.Char]
    	  [Str: DmdType]
    	  xs7_s1i8 = go_r1og ys_aGO
    	} in 
    	  \ (m_XWf :: GHC.Prim.Int#) ->
    	    case GHC.Prim.<=# m_XWf 1 of wild1_aSI {
    	      GHC.Base.False ->
    		GHC.Base.: @ GHC.Base.Char y_aGN (xs7_s1i8 (GHC.Prim.-# m_XWf 1));
    	      GHC.Base.True -> GHC.Base.: @ GHC.Base.Char y_aGN lvl13_r1n0
    	    }
        }
Notice the 'let' which stops the lambda moving out.


89 90 91 92 93 94 95
Eliza
~~~~~
In June 2002, GHC 5.04 emitted four successive
    NOTE: Simplifier still going after 4 iterations; bailing out.
messages.  I suspect that the simplifer is looping somehow.


96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125
Expert
~~~~~~
In spectral/expert/Search.ask there's a statically visible CSE. Catching this 
depends almost entirely on chance, which is a pity.


Fish
~~~~
The performance of fish depends crucially on inlining scale_vec2.
It turns out to be right on the edge of GHC's normal threshold size, so
small changes to the compiler can knock it to and fro.


Integer
~~~~~~~
A good benchmark for beating on big-integer arithmetic

Knights
~~~~~~~
In knights/KnightHeuristic, we don't find that possibleMoves is strict
(with important knock-on effects) unless we apply rules before floating
out the literal list [A,B,C...].
Similarly, in f_se (F_Cmp ...) in listcompr (but a smaller effect)


Lambda
~~~~~~
This program shows the cost of the non-eta-expanded lambdas that arise from
a state monad.  

126 127 128 129 130
Mandel
~~~~~~
Relies heavily on having a specialised version of Complex.magnitude
(:: Complex Double -> Double) available.

131 132 133 134 135 136
Mandel.mandelset has a go-loop inside an enumDeltaToIntegerFB, out of which
4.08.2 managed to float a constant expression, but HEAD did not.  I think
this is because the pre-let-floating simplification did too little inlining;
in particular, it did not inline windowToViewport


137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157
Multiplier
~~~~~~~~~~
In spectral/multiplier, we have 
    xor = lift21 forceBit f
      where f :: Bit -> Bit -> Bit
	    f 0 0 = 0
	    f 0 1 = 1
	    f 1 0 = 1
	    f 1 1 = 0
  Trouble is, f is CPR'd, and that means that instead of returning
  the constants I# 0, I# 1, it returns 0,1 and then boxes them.
  So allocation goes up.  I don't see a way around this.


Parstof
~~~~~~~
spectral/hartel/parstof ends up saying
	case (unpackCString "x") of { c:cs -> ... }
  quite a bit.   We should spot these and behave accordingly.


158 159 160 161 162
Power
~~~~~
With GHC 4.08, for some reason the arithmetic defaults to Double.  The
right thing is to default to Rational, which accounts for the big increase
in runtime after 4.08
163 164


165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185
Puzzle
~~~~~~
The main function is 'transfer'.  It has some complicated join points, and
I found that in my experimental proto 4.09 compiler I had 

	let ds = go xs in
	let $j = .... ds ... in
	case e of
	  p1 -> $j
	  p2 -> $j
	  p3 -> ds

But ds wasn't getting inlined because we couldn't spot that it was actually
only being used once.   Keith's analyser should find this!


Also, making concat into a good producer made a large gain.

My proto 4.09 still allocates more, partly because of more full laziness relative
to 4.08; I don't know why that happens

186 187 188
Extra allocation is happening in 5.02 as well; perhaps for the same reasons.  There is 
at least one instance of floating that prevents fusion; namely the enumerated lists
in 'transfer'.
189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245

Sphere
~~~~~~
A key function is vecsub, which looks like this (after w/w)

$wvecsub
  = \ ww :: Double  ww1 :: Double ww2 :: Double
      ww3 :: Double ww4 :: Double ww5 :: Double ->
	let { a = case ww of wild { D# x ->
	      	  case ww3 of wild1 { D# y ->
	      	  let { a1 = -## x y
	      	  } in  $wD# a1
	      	  } } } in
	let { a1 = case ww1 of wild { D# x ->
	      	   case ww4 of wild1 { D# y ->
	      	   let { a2 = -## x y
	      	   } in  $wD# a2
	      	   } } } in
	let { a2 = case ww2 of wild { D# x ->
	      	   case ww5 of wild1 { D# y ->
	      	   let { a3 = -## x y
	      	   } in  $wD# a3
	      	   } } 
	} in  (# a, a1, a2 #)

Currently it gets guidance: IF_ARGS 6 [2 2 2 2 2 2] 25 4
It occurs in a context like

	case $wvecsub a b c d e f of  (# p, q, r #) ->
	case p of		      D# p' ->
	case q of		      D# q' ->
        case r of		      D# r' ->

So it would be much, much better to inline it.  But the result-discounting
doesn't "see" that the three results are each taken apart, and that's the
problem.

One question is this: why did minusDouble get inlined? If it didn't then
$vecsub would look much smaller:

$wvecsub
  = \ ww :: Double  ww1 :: Double ww2 :: Double
      ww3 :: Double ww4 :: Double ww5 :: Double ->
	let { a  = minusDouble ww  ww3 } in
	let { a1 = minusDouble ww1 ww4 } in
	let { a2 = minusDouble ww2 ww5 } in
	(# a, a1, a2 #)

Reason minusDouble gets inlined: because we are trying to get update in place.
So I added a flag -funfolding-update-in-place to enable this "optimisation".
Omitting the flag gives much better inlining for $wvecsub at least.


Sphere also does 60,000 calls to hPutStr, so I/O plays a major role.  Currently
this I/O does a *lot* of allocation, much of it since the adddition of thread-safety.


246 247 248 249
Treejoin
~~~~~~~~
Does a lot of IO.readFile.

250 251 252 253 254 255
---------------------------------------
	Real suite
---------------------------------------


	
256 257 258 259 260 261 262 263
Maillist
~~~~~~~~

Uses appendFile repeatedly rather than opening the output file once,
which leads to numerous file opens/closes.  Allocations will rise with
the new I/O subsystem in 5.02 because the I/O buffer will be
re-allocated on the heap for each open, whereas previously it would be
allocated on the C heap and therefore not show up in the stats.