The active SPECIALIZE pragma should be applied, but isn't. This can be seen by comparing the core and runtimes of fcTest (slow) vs vtTest (fast). I need this version of the pragma in my real code as the phantom type m is reified, so I need to specialize the vector code without specifying the phantom type.
I can get fcTest to run fast if I use the commented-out SPECIALIZE pragma instead. However, that pragma seems very straightforward to me (all types are concrete). The docs indicate that GHC should automatically specialize in most cases, why does it not specialize to the commented-out signature automatically?
Where are runRand and liftRand defined? Hoogle can't find either. Google finds package MonadRandom on hackage, here, but I can't see liftRand.
Would it be possible to rejig the example so that it doesn't use this extra package? Makes it much easier to reproduce. You don't, presumably, actually need random numbers to demonstrate the bug.
Thanks that's helpful. I can now at least compile it.
One complicating factor is that you have a Num instance for FastCyc. You could simplify the setup by calling plusFastCyc in cyclotomicTimeTest, and defining
plusFastCyc :: Num (t r) => FastCyc t r -> FastCyc t r -> FastCyc t rplusFastCyc (PowBasis v1) (PowBasis v2) = PowBasis $ v1 + v2plusFastCyc p1@(DecBasis _) p2@(PowBasis _) = (g p1) + p2plusFastCyc p1@(PowBasis _) p2@(DecBasis _) = p1 + (g p2)plusFastCyc p1 p2 = (g p1) + (g p2)
Now you can write your specilise pragmas (or not) for that. Do you get the same behaviour? That would elminate one source of complexity.
Can you say in more detail what you expect to happen? The slow version (could we call it slow rather than cyclotomicTimeTest?) iterates plusFastCyc which necessariliy does a lot more work unpacking and packing those PowBasis constructors. Are you ok with that? But you aren't ok with something.
In short, it's a bit complicated for me to understand the problem.
Maybe you can show some -ddump-simpl core and say "this call here should be specialsed, why doesn't the rule fire?
Incidentally, if you want GHC to auto-specialise an imported function, to types that may not even be in scope in the defining module, you should mark that function as INLINABLE
I updated the files again per your suggestion, and get identical behavior.
I expect the vtTest (currently fast) and fcTest (currently slow) to have the same runtime, e.g. < 2 seconds runtime difference. I do 'not' want to inline plusFastCyc: the function is too large and used in many places so inlining everywhere would create code bloat. Thus I want GHC to specialize the call to plusFastCyc rather than inlining it.
The two functions are doing the same work, but fcTest has one more level of indirection (one more wrapper on the type, and one more function call per addition).
As far as "doing a lot of work unpacking Pow constructors", plusFastCyc is iterated 100 times, but the runtime difference is 1 minute 18 seconds. I'm willing to pay for 100 function calls, but I think we can agree that something more is going on than just unpacking constructors.
I put some core snippets here http://lpaste.net/98593
This was compiled with -O3 using the forall'd SPECIALIZATION.
On line 10, you can see that GHC does write a specialized version of plusFastCyc, compared to the generic version on line 167.
The rule for the specialization is on line 225. I believe this rule should fire on line 270. (main6 calls iterate main8 y, so main8 is where plusFastCyc should be specialized.)
In regards to GHC auto-specializing: if I do not explicitly specialize plusFastCyc at all (but still mark it as INLINABLE), fcTest is slow. If instead I specialize plusFastCyc with concrete types as in the comment, fcTest is fast. Thus it appears GHC is *not* auto-specializing plusFastCyc, despite it being marked as INLINABLE.
Inlinable is known to prevent specialization from firing. (unless my recollection is wrong, has something to do with the order of the relevant passes in ghc afaik)
Inlinable is known to prevent specialization from firing. (unless my recollection is wrong, has something to do with the order of the relevant passes in ghc afaik)
I tried removing the INLINABLEs, but that didn't help. I was under the impression that at least for auto-specialization across modules, INLINABLE is required. I've also tried playing around with phase control (just a little) to avoid inlining/specialization order problems, but I couldn't get that to work either.
If you write the SPECIALIZE pragma instance in the defining module for the operation (assuming the associated class instances specialize too, check that they can specialize mebe?), you don't need the INLINEABLE (unless you want other instances to be specialized).
what happens when you go one step in the other direction and use INLINE?
For the small example above, GHC simply inlines everything and both main functions are equally fast.
However, I tried this in my real code and the equivalent of plusFastCyc is too large for GHC to inline (even with INLINE pragmas), and used in too many places for that to be a good solution even if it did work. That's why I'm trying to make GHC call a specialized function instead.
In real code, plusFastCyc will call a function in class Factored. The function in Factored and the call to it in plusFastCyc were removed because they weren't needed to demonstrate the specialization problem.
{-# SPECIALIZE plusFastCyc :: (FastCyc (VT U.Vector m) Int) -> (FastCyc (VT U.Vector m) Int) -> (FastCyc (VT U.Vector m) Int) #-}
isn't satisfactory?
I added some more code to demonstrate something more like a real use case for the program: http://lpaste.net/98840. It has the same issue as Foo.hs above, but hopefully you can see why the constraint is needed. I wanted to remove it from the example to show that the problem wasn't type families, constraint kinds, etc.