libraries/base/base.cabal · 9c4ea90c6b493eee6df1798c63a6031cc18ae6da · Glasgow Haskell Compiler / GHC

base: Introduce GHC.Encoding.UTF8 · e8ac91db

Ben Gamari authored Jul 16, 2022 and

Marge Bot committed Jul 22, 2022

Here we copy a subset of the UTF-8 implementation living in `ghc-boot`
into `base`, with the intent of dropping the former in the future. For
this reason, the `ghc-boot` copy is now CPP-guarded on
`MIN_VERSION_base(4,18,0)`.

Naturally, we can't copy *all* of the functions defined by `ghc-boot` as
some depend upon `bytestring`; we rather just copy those which only
depend upon `base` and `ghc-prim`.

Further consolidation?
----------------------

Currently GHC ships with at least five UTF-8 implementations:

* the implementation used by GHC in `ghc-boot:GHC.Utils.Encoding`; this
  can be used at a number of types including `Addr#`, `ByteArray#`,
  `ForeignPtr`, `Ptr`, `ShortByteString`, and `ByteString`. Most of this
  can be removed in GHC 9.6+2, when the copies in `base` will become
  available to `ghc-boot`.
* the copy of the `ghc-boot` definition now exported by
  `base:GHC.Encoding.UTF8`. This can be used at `Addr#`, `Ptr`,
  `ByteArray#`, and `ForeignPtr`
* the decoder used by `unpackCStringUtf8#` in `ghc-prim:GHC.CString`;
  this is specialised at `Addr#`.
* the codec used by the IO subsystem in `base:GHC.IO.Encoding.UTF8`;
  this is specialised at `Addr#` but, unlike the above, supports
  recovery in the presence of partial codepoints (since in IO contexts
  codepoints may be broken across buffers)
* the implementation provided by the `text` library

This does seem a tad silly. On the other hand, these implementations
*do* materially differ from one another (e.g. in the types they support,
the detail in errors they can report, and the ability to recover from
partial codepoints). Consequently, it's quite unclear that further
consolidate would be worthwhile.

e8ac91db