pmonday · 5dfaa332
--- a/simd-plan.md
+++ b/simd-plan.md
@@ -6,6 +6,7 @@ This text documents the implementation stages / components for adding SIMD suppo

 Based on that design, the high-level tasks that must be accomplished include the following:

+1. Modify autoconf to determine SSE availability and vector size
 1. Add new PrimOps to allow Haskell to make use of Vectors
 1. Add new MachOps to Cmm to communicate use of Vectors
 1. Modify the LLVM Code Generator to translate Cmm to LLVM vector instructions
@@ -17,8 +18,8 @@ Based on that design, the high-level tasks that must be accomplished include the

 Introduction of SIMD support to GHC will occur in stages to demonstrate the entire “vertical” stack is functional:

-1. Introduce “Double” PrimOps (as necessary to run an example showing SIMD usage in the LLVM)
-1. Add appropriate Cmm support for the Double primtype / primop subset
+1. Introduce “Float” PrimOps (as necessary to run an example showing SIMD usage in the LLVM)
+1. Add appropriate Cmm support for the Float primtype / primop subset
 1. Modify the LLVM Code Generator to support the Double vectorization
 1. Demonstrate the PrimOps and do limited performance testing to ensure SIMD is functional
 1. Modify Vector Libraries to make use of new PrimOps
@@ -33,6 +34,106 @@ Introduction of SIMD support to GHC will occur in stages to demonstrate the enti

 These clearly won't be all of the questions I have, there is a substantial amount of work that goes through the entire GHC compiler stack before reaching the LLVM instructions.

+## Modify autoconf
+
+
+Determining if a particular hardware architecture has SIMD instructions, the version of the instructions available (SSE, SSE2, SSE3, SSE4 or an iteration of one of those), and consequently the size of vectors that are supported occurs during the configuration step on the architecture that the build will occur on.  This is the same time that the sizes of Ints are calculated, alignment constants, and other pieces that are critical to GHC.
+
+
+Backing up from the results to the location that the changes are introduced, the current alignment and primitive sizes are available in ./includes/ghcautoconfig.h, here is a sample:
+
+```wiki
+...
+/* The size of `char', as computed by sizeof. */
+#define SIZEOF_CHAR 1
+
+/* The size of `double', as computed by sizeof. */
+#define SIZEOF_DOUBLE 8
+...
+```
+
+
+These are constructed from mk/config.h\* that are generated by configure.ac and autoheader.  The configure.ac (or a related file) should be able to be sufficiently modified to determine if SSE is available and, consequently, the vector size that can be operated on and (later) how many pieces of data can be operated on in parallel (determined by the operation).  SSE had an MMX register size of 64-bits and all later SSE versions (2 and above) have a register size of 128 bits.  This implies that any type that is 32-bits can have 4 pieces of data calculated against in a single instruction.
+
+
+There is an example of configure.ac modifications to detect SSE availability available on the web, the primary body of the check is as follows (xmmintrin.h contains the SSE instruction set):
+
+```wiki
+AC_MSG_CHECKING(for SSE in current arch/CFLAGS)
+AC_LINK_IFELSE([
+AC_LANG_PROGRAM([[
+#include <xmmintrin.h>
+__m128 testfunc(float *a, float *b) {
+  return _mm_add_ps(_mm_loadu_ps(a), _mm_loadu_ps(b));
+}
+]])],
+[
+has_sse=yes
+],
+[
+has_sse=no
+]
+)
+AC_MSG_RESULT($has_sse)
+
+if test "$has_sse" = yes; then
+  AC_DEFINE([_USE_SSE], 1, [Enable SSE support])
+  AC_DEFINE([_VECTOR_SIZE], 128, [Default Vector Size for Optimization])
+fi
+```
+
+
+Once the mk/config.h is modified with the above, the includes/ghcautoconf.h is modified during the first stage of the GHC build process.  Once includes/ghcautoconf.h is modified, the _USE_SSE constant is available in the Cmm definitions (next section).
+
+
+There are more detailed explanations of how to use cpuid to determine the supported SSE instruction set available on the web as well.  cpuid may be more appropriate but are also much more complex.  Details for using cpuid are available at [ http://software.intel.com/en-us/articles/using-cpuid-to-detect-the-presence-of-sse-41-and-sse-42-instruction-sets/](http://software.intel.com/en-us/articles/using-cpuid-to-detect-the-presence-of-sse-41-and-sse-42-instruction-sets/).
+
+
+It should be noted, that since the overall goal is to let the LLVM handle the actual assembly code that does vectorization, it's only possible to support vectorization up to the version that the LLVM supports.
+
+## Add new MachOps to Cmm code
+
+
+It may make more sense to add the MachOps to Cmm prior to implementing the PrimOps (or at least before adding the code to the CgPrimOp.hs file).  There is a useful [ Cmm Wiki Page](http://hackage.haskell.org/trac/ghc/wiki/Commentary/Compiler/CmmType#AdditionsinCmm) available to aid in the definition of the new Cmm operations.
+
+
+Modify compiler/cmm/CmmType.hs to add new required vector types and such, here is a basic outline of what needs to be done:
+
+```wiki
+data CmmType    -- The important one!
+  = CmmType CmmCat Width
+ 
+type Multiplicty = Int
+ 
+data CmmCat     -- "Category" (not exported)
+   = GcPtrCat   -- GC pointer
+   | BitsCat   -- Non-pointer
+   | FloatCat   -- Float
+   | VBitsCat  Multiplicity   -- Non-pointer
+   | VFloatCat Multiplicity  -- Float
+   deriving( Eq )
+        -- See Note [Signed vs unsigned] at the end
+```
+
+
+Modify compiler/cmm/CmmMachOp.hs, this will add the necessary MachOps for use from the PrimOps modifications to support SIMD.  Here is an example of adding a SIMD version of the MO_F_Add MachOp:
+
+```wiki
+  -- Integer SIMD arithmetic
+  | MO_V_Add  Width Int
+  | MO_V_Sub  Width Int
+  | MO_V_Neg  Width Int         -- unary -
+  | MO_V_Mul  Width Int
+  | MO_V_Quot Width Int
+
+  -- Floating point arithmetic
+  | MO_VF_Add Width Int   -- MO_VF_Add W64 4   Add 4-vector of 64-bit floats
+  ...
+```
+
+
+Some existing Cmm instructions may be able to be reused, but there will have to be additional instructions added to account for vectorization primitives.  This will help keep the SIMD / non-SIMD code generation separate for the time being until we have it working.
+
 ## Add new PrimOps


@@ -121,61 +222,34 @@ The above, after compilation, adds the following to the ./compiler/prelude/PrimO
   | FloatVectorAddOp
 ```

-## Add new MachOps to Cmm code
-
-
-It may make more sense to add the MachOps to Cmm prior to implementing the PrimOps (or at least before adding the code to the CgPrimOp.hs file).  There is a useful [ Cmm Wiki Page](http://hackage.haskell.org/trac/ghc/wiki/Commentary/Compiler/CmmType#AdditionsinCmm) available to aid in the definition of the new Cmm operations.
-
+## Modify LLVM Code Generator

-Modify compiler/cmm/CmmType.hs to add new required vector types and such, here is a basic outline of what needs to be done:
-
-```wiki
-data CmmType    -- The important one!
-  = CmmType CmmCat Width
- 
-type Multiplicty = Int
- 
-data CmmCat     -- "Category" (not exported)
-   = GcPtrCat   -- GC pointer
-   | BitsCat   -- Non-pointer
-   | FloatCat   -- Float
-   | VBitsCat  Multiplicity   -- Non-pointer
-   | VFloatCat Multiplicity  -- Float
-   deriving( Eq )
-        -- See Note [Signed vs unsigned] at the end
-```

+Take the MachOps in the Cmm definition and translate correctly to the corresponding LLVM instructions.  LLVM code generation is in the /compiler/llvmGen directory.  The following will have to be modified (at a minimum):

-Modify compiler/cmm/CmmMachOp.hs, this will add the necessary MachOps for use from the PrimOps modifications to support SIMD.  Here is an example of adding a SIMD version of the MO_F_Add MachOp:
+- /compiler/llvmGen/Llvm/Types.hs - add the MachOps from Cmm and how they bridge to the LLVM vector operations
+- /compiler/llvmGen/LlvmCodeGen/CodeGen.hs - This is the heart of the translation from MachOps to LLVM code.   Possibly significant changes will have to be added.

-```wiki
-  -- Integer SIMD arithmetic
-  | MO_V_Add  Width Int
-  | MO_V_Sub  Width Int
-  | MO_V_Neg  Width Int         -- unary -
-  | MO_V_Mul  Width Int
-  | MO_V_Quot Width Int
+- Remaining /compiler/llvmGen/\* - Supporting changes

-  -- Floating point arithmetic
-  MO_VF_Add Width Int   -- MO_VF_Add W64 4   Add 4-vector of 64-bit floats
-  ...
-```

+Once the LLVM Code Generator is modified to support Double instructions, tests can be run to ensure the “bottom half” of the stack works.

-Some existing Cmm instructions may be able to be reused, but there will have to be additional instructions added to account for vectorization primitives.  This will help keep the SIMD / non-SIMD code generation separate for the time being until we have it working.
+## Modify Native Code Generator

-## Modify LLVM Code Generator

+Unfortunately, the native code generator will also have to be recompiled.  The GHC compilation depends on a 6.x version of GHC, before native LLVM code generation was built into the HEAD (so simply modifying the mk/build.mk file to go to -fllvm does not work).

-Take the MachOps in the Cmm definition and translate correctly to the corresponding LLVM instructions.  LLVM code generation is in the /compiler/llvmGen directory.  The following will have to be modified (at a minimum):

- /compiler/llvmGen/Llvm/Types.hs - add the MachOps from Cmm and how they bridge to the LLVM vector operations
- /compiler/llvmGen/LlvmCodeGen/CodeGen.hs - This is the heart of the translation from MachOps to LLVM code.   Possibly significant changes will have to be added.
+For x86 Native Code Generation, locate the ./compiler/nativeGen/X86/CodeGen.hs file and modify it appropriately.  For the example above, simply adding a conversion from MO_VF_Add to the equivalent non-vector add is sufficient.

- Remaining /compiler/llvmGen/\* - Supporting changes
+```wiki
+      MO_VF_Add w i | sse2      -> trivialFCode_sse2 w ADD  x y
+                    | otherwise -> trivialFCode_x87    GADD x y   
+```


-Once the LLVM Code Generator is modified to support Double instructions, tests can be run to ensure the “bottom half” of the stack works.
+Changes for the remaining new MachOps may be much larger.

 ## Example: Demonstrate SIMD Operation

@@ -261,6 +335,11 @@ To make this work, a ghc compiler flag must be added that forces all vector leng
 - Add the ghc compiler option: --vector-length=1
 - Modify compiler/vectorise to recognize the new option or add this compiler pass as appropriate

+## Reference Documentation
+
+- [ LLVM Vectorization](https://wiki.aalto.fi/display/t1065450/LLVM+vectorization)
+- [ LLVM Vector Type](http://llvm.org/docs/LangRef.html#t_vector)
+
 ## Reference Discussion Threads

 ```wiki