Skip to content

ghc-boot: Move simple UTF-8 codecs into base

Ben Gamari requested to merge wip/base-utf8-codecs into master

ghc-boot defines a set of simple UTF-8 codecs which are used by ghc in its lexer, FastString module, and elsewhere. However, in !2816 (merged) we needed a similar set of codecs in base for encoding and decoding small strings into and out of ByteArray#s. This branch begins the process of migrating these codecs from ghc-boot into base. We do this as follows:

  • Move the ghc-boot UTF-8 logic from GHC.Utils.Encoding into GHC.Utils.Encoding.UTF8
  • Rename the functions to improve naming consistency
  • Copy the subset functions which only depend on base and ghc-prim exports into a new base module, GHC.Encoding.UTF8
  • CPP-guard the ghc-boot definitions on MIN_VERSION_base(4,18,0), allowing us to remove them in two GHC releases

Further consolidation?

Currently GHC ships with at least five UTF-8 implementations:

  • the implementation used by GHC in ghc-boot:GHC.Utils.Encoding; this can be used at a number of types including Addr#, ByteArray#, ForeignPtr, Ptr, ShortByteString, and ByteString. Most of this can be removed in GHC 9.6+2, when the copies in base will become available to ghc-boot.
  • the copy of the ghc-boot definition now exported by base:GHC.Encoding.UTF8. This can be used at Addr#, Ptr, ByteArray#, and ForeignPtr
  • the decoder used by unpackCStringUtf8# in ghc-prim:GHC.CString; this is specialised at Addr#.
  • the codec used by the IO subsystem in base:GHC.IO.Encoding.UTF8; this is specialised at Addr# but, unlike the above, supports recovery in the presence of partial codepoints (since in IO contexts codepoints may be broken across buffers)
  • the implementation provided by the text library

This does seem a tad silly. On the other hand, these implementations do materially differ from one another (e.g. in the types they support, the detail in errors they can report, and the ability to recover from partial codepoints). Consequently, it's quite unclear that further consolidate would be worthwhile.

Edited by Ben Gamari

Merge request reports