Optimize `putSLEB128` to use fewer memory writes.
Currently we compute one byte at a time and write them out as we go which comes at a good deal of overhead.
However we could instead:
- Aggregate up to 8 bytes inside a Word64 at a time
- Write those out using one-three writes. (If we need to write out 7 bytes we can write 4,2 and then 1 byte).
- Aggregate the next up-to 8 bytes and repeat.
We know every value less or equal to 72057594037927935 :: Word64
fits into 8 bytes. So we could also split the code into a large and small integer code path, the small one aggregating the result into a Word64 before writing it, the large doing what we done no as large values are rare so their performance isn't critical.
Do note however that putWord64 and friends currently also write one byte at a time and should be updated to just perform a single write when the architecture allows for it given it's alignment restrictions.