Add primops for MMX parallel subtract and parallel add intrinsics
Parallel subtract: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=MMX&text=_m_psub&expand=5805
Parallel add: https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=MMX&text=_m_padd&expand=5805
These to byte-wise, short-wise, word-wise subtraction and addition in a word64. There is also a saturated version of each which caps at the maximum and minimum value to prevent overflow/underflow.
These instructions are significantly faster than broadword programming, which in turn is faster than word at a time processing.
On systems that don't have these instructions the behaviour can be emulated with broadword programming.
Details to follow.