Implement SSE2 floating-point support in the x86 native code generator (#594)
The new flag -msse2 enables code generation for SSE2 on x86. It results in substantially faster floating-point performance; the main reason for doing this was that our x87 code generation is appallingly bad, and since we plan to drop -fvia-C soon, we need a way to generate half-decent floating-point code. The catch is that SSE2 is only available on CPUs that support it (P4+, AMD K8+). We'll have to think hard about whether we should enable it by default for the libraries we ship. In the meantime, at least -msse2 should be an acceptable replacement for "-fvia-C -optc-ffast-math -fexcess-precision". SSE2 also has the advantage of performing all operations at the correct precision, so floating-point results are consistent with other platforms. I also tweaked the x87 code generation a bit while I was here, now it's slighlty less bad than before.