-
Here we copy a subset of the UTF-8 implementation living in `ghc-boot` into `base`, with the intent of dropping the former in the future. For this reason, the `ghc-boot` copy is now CPP-guarded on `MIN_VERSION_base(4,18,0)`. Naturally, we can't copy *all* of the functions defined by `ghc-boot` as some depend upon `bytestring`; we rather just copy those which only depend upon `base` and `ghc-prim`. Further consolidation? ---------------------- Currently GHC ships with at least five UTF-8 implementations: * the implementation used by GHC in `ghc-boot:GHC.Utils.Encoding`; this can be used at a number of types including `Addr#`, `ByteArray#`, `ForeignPtr`, `Ptr`, `ShortByteString`, and `ByteString`. Most of this can be removed in GHC 9.6+2, when the copies in `base` will become available to `ghc-boot`. * the copy of the `ghc-boot` definition now exported by `base:GHC.Encoding.UTF8`. This can be used at `Addr#`, `Ptr`, `ByteArray#`, and `ForeignPtr` * the decoder used by `unpackCStringUtf8#` in `ghc-prim:GHC.CString`; this is specialised at `Addr#`. * the codec used by the IO subsystem in `base:GHC.IO.Encoding.UTF8`; this is specialised at `Addr#` but, unlike the above, supports recovery in the presence of partial codepoints (since in IO contexts codepoints may be broken across buffers) * the implementation provided by the `text` library This does seem a tad silly. On the other hand, these implementations *do* materially differ from one another (e.g. in the types they support, the detail in errors they can report, and the ability to recover from partial codepoints). Consequently, it's quite unclear that further consolidate would be worthwhile.
e8ac91db