Partial fix for #2917
- add newAlignedPinnedByteArray# for allocating pinned BAs with arbitrary alignment - the old newPinnedByteArray# now aligns to 16 bytes Foreign.alloca will use newAlignedPinnedByteArray#, and so might end up wasting less space than before (we used to align to 8 by default). Foreign.allocaBytes and Foreign.mallocForeignPtrBytes will get 16-byte aligned memory, which is enough to avoid problems with SSE instructions on x86, for example. There was a bug in the old newPinnedByteArray#: it aligned to 8 bytes, but would have failed if the header was not a multiple of 8 (fortunately it always was, even with profiling). Also we occasionally wasted some space unnecessarily due to alignment in allocatePinned(). I haven't done anything about Foreign.malloc/mallocBytes, which will give you the same alignment guarantees as malloc() (8 bytes on Linux/x86 here).
Showing with 63 additions and 25 deletions