Skip to content
  • Ben Gamari's avatar
    base: Introduce GHC.Encoding.UTF8 · e8ac91db
    Ben Gamari authored and Marge Bot's avatar Marge Bot committed
    Here we copy a subset of the UTF-8 implementation living in `ghc-boot`
    into `base`, with the intent of dropping the former in the future. For
    this reason, the `ghc-boot` copy is now CPP-guarded on
    `MIN_VERSION_base(4,18,0)`.
    
    Naturally, we can't copy *all* of the functions defined by `ghc-boot` as
    some depend upon `bytestring`; we rather just copy those which only
    depend upon `base` and `ghc-prim`.
    
    Further consolidation?
    ----------------------
    
    Currently GHC ships with at least five UTF-8 implementations:
    
    * the implementation used by GHC in `ghc-boot:GHC.Utils.Encoding`; this
      can be used at a number of types including `Addr#`, `ByteArray#`,
      `ForeignPtr`, `Ptr`, `ShortByteString`, and `ByteString`. Most of this
      can be removed in GHC 9.6+2, when the copies in `base` will become
      available to `ghc-boot`.
    * the copy of the `ghc-boot` definition now exported by
      `base:GHC.Encoding.UTF8`. This can be used at `Addr#`, `Ptr`,
      `ByteArray#`, and `ForeignPtr`
    * the decoder used by `unpackCStringUtf8#` in `ghc-prim:GHC.CString`;
      this is specialised at `Addr#`.
    * the codec used by the IO subsystem in `base:GHC.IO.Encoding.UTF8`;
      this is specialised at `Addr#` but, unlike the above, supports
      recovery in the presence of partial codepoints (since in IO contexts
      codepoints may be broken across buffers)
    * the implementation provided by the `text` library
    
    This does seem a tad silly. On the other hand, these implementations
    *do* materially differ from one another (e.g. in the types they support,
    the detail in errors they can report, and the ability to recover from
    partial codepoints). Consequently, it's quite unclear that further
    consolidate would be worthwhile.
    e8ac91db