nofib/real/gg is miscompiled at -O1

I've noticed that nofib/real/gg fails (output mismatch) and after reducing the problem I've got the following small example (the trace expressions can also be removed):

  module Main where

    import Debug.Trace

    main = do
    putStrLn $ printFloat 100

    printFloat :: Float -> String
    printFloat x = f (show (round (x*10)))
    where
    f "0" = trace "case 0" "0"
    f _   = trace "case _" $ show (round x)
  }}}
compiling it with current HEAD:
{{{
  > ~/dev/ghc-clean/inplace/bin/ghc-stage2 -fforce-recomp -O0 Test.hs
    [1 of 1] Compiling Main             ( Test.hs, Test.o )
    Linking Test ...
    > ./Test
    case _
    100
    > ~/dev/ghc-clean/inplace/bin/ghc-stage2 -fforce-recomp -O1 Test.hs
    [1 of 1] Compiling Main             ( Test.hs, Test.o )
    Linking Test ...
    > ./Test
    case _
    1000
    > ~/dev/ghc-clean/inplace/bin/ghc-stage2 -fforce-recomp -O2 Test.hs
    [1 of 1] Compiling Main             ( Test.hs, Test.o )
    Linking Test ...
    > ./Test
    case _
    100
  }}}
Note that with ```-O1``` the output is 1000! It seems that the bug is either in
the old codegen or the new one does not trigger it:
{{{
  > ~/dev/ghc-clean/inplace/bin/ghc-stage2 -fforce-recomp -O1 -fnew-codegen Test.hs
    [1 of 1] Compiling Main             ( Test.hs, Test.o )
    Linking Test ...
    > ./Test
    case _
    100

I've also looked at assembly and the only thing that I've noticed is that two instructions are in different order when compiling with O1:

  Main.$wprintFloat_info:
    _c28u:
    leaq -40(%rbp),%rax
    cmpq %r15,%rax
    jb _c2aV
    mulss _n2aX(%rip),%xmm1    <--- !
    movss %xmm1,%xmm0          <--- !
    subq $8,%rsp
    movl $1,%eax
    call rintFloat
    addq $8,%rsp
    movss %xmm1,-8(%rbp)
    movss %xmm0,%xmm1
    movq $s23Y_info,-16(%rbp)
    addq $-16,%rbp
    jmp stg_decodeFloat_Int#
    _c2aV:
    movl $Main.$wprintFloat_closure,%ebx
    jmp *-8(%r13)
    .size Main.$wprintFloat_info, .-Main.$wprintFloat_info
    .section .rodata
    .align 8
    .align 4
  }}}
and ```O2```:
{{{
  Main.$wprintFloat_info:
    _c28R:
    leaq -40(%rbp),%rax
    cmpq %r15,%rax
    jb _c2aX
    movss %xmm1,%xmm0          <--- !
    mulss _n2aZ(%rip),%xmm0    <--- !
    subq $8,%rsp
    movl $1,%eax
    call rintFloat
    addq $8,%rsp
    movss %xmm1,-8(%rbp)
    movss %xmm0,%xmm1
    movq $s24l_info,-16(%rbp)
    addq $-16,%rbp
    jmp stg_decodeFloat_Int#
    _c2aX:
    movl $Main.$wprintFloat_closure,%ebx
    jmp *-8(%r13)
    .size Main.$wprintFloat_info, .-Main.$wprintFloat_info
    .section .rodata
    .align 8
    .align 4
  }}}
If I read this right, in ```O1``` case the ```xmm1``` register will contain 1000
(1000 * 10) and this value will be stored on the stack, whereas the ```O2```
version first moves the value from ```xmm1``` to ```xmm0``` and only then
multiplies it (and also stores ```xmm1``` on the stack but this time it should
be equal to 100). So if the value stored on the stack is subsequently used, it
would explain the difference between the two programs.

For the record this is on x86_64 Linux and the GHC version used is:
{{{
> ~/dev/ghc-clean/inplace/bin/ghc-stage2 --version
The Glorious Glasgow Haskell Compilation System, version 7.7.20120816

Everything works as expected on GHC-7.4.2.

Trac metadata

Trac field	Value
Version	7.7
Type	Bug
TypeOfFailure	OtherFailure
Priority	normal
Resolution	Unresolved
Component	Compiler
Test case
Differential revisions
BlockedBy
Related
Blocking
CC
Operating system
Architecture

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information