Add primops for MMX parallel subtract and parallel add intrinsics
These to byte-wise, short-wise, word-wise subtraction and addition in a word64. There is also a saturated version of each which caps at the maximum and minimum value to prevent overflow/underflow.
These instructions are significantly faster than broadword programming, which in turn is faster than word at a time processing.
On systems that don't have these instructions the behaviour can be emulated with broadword programming.
Details to follow.