Source Encoding Detection
Brief Explanation
Haskell source code uses the Unicode character set. However, current implementations either support only one encoding (e.g. UTF-8), or require the encoding to be signified via out-of-band means, which makes Haskell source code outside the ASCII range non-portable (see UnicodeInHaskellSource).
This proposal outlines a detection heuristic that categorizes the source code as under UTF-8, UTF-16 or UTF-32. A conforming Haskell-prime implementation must accept UTF-8 and UTF-16, and may fail on UTF-32 input.
This proposal does not cover user-specified source encoding.
References
Proposal
This heuristic uses at most 4 bytes from the byte representation of Haskell source code.
import Data.Word
data EncodedSource
= UTF8 [Word8]
| UTF16 Endian [Word8]
| UTF32 Endian [Word8]
-- | UserDefined ...
data Endian = LittleEndian | BigEndian
detectSourceEncoding :: [Word8] -> EncodedSource
detectSourceEncoding bytes = case bytes of
[] -> UTF8 []
[0x00] -> invalidNulls
xs@[_] -> UTF8 xs
[0xFF, 0xFE] -> UTF16 LittleEndian []
(0xFE:0xFF:xs) -> UTF16 BigEndian xs
[0x00, 0x00] -> invalidNulls
xs@[0x00, _] -> UTF16 BigEndian xs
xs@[_, 0x00] -> UTF16 LittleEndian xs
xs@[_, _] -> UTF8 xs
[0x00, 0x00, 0x00] -> invalidNulls
xs@[_, _, _] -> UTF8 xs
(0xEF:0xBB:0xBF:xs) -> UTF8 xs
(0x00:0x00:0xFE:0xFF:xs) -> UTF32 BigEndian xs
(0xFF:0xFE:0x00:0x00:xs) -> UTF32 LittleEndian xs
(0xFF:0xFE:xs) -> UTF16 LittleEndian xs
(0x00:0x00:0x00:0x00:_) -> invalidNulls
xs@(0x00:0x00:0x00:_) -> UTF32 BigEndian xs
xs@(_:0x00:0x00:0x00:_) -> UTF32 LittleEndian xs
(0x00:0x00:_) -> invalidNulls
xs@(0x00:_) -> UTF16 BigEndian xs
xs@(_:0x00:_) -> UTF16 LittleEndian xs
xs -> UTF8 xs
where
invalidNulls = error "(implementation-specific error message)"
The heuristic has the following properties:
- Byte-order mark is optional on all three encodings.
- If present, byte-order-marks are consumed before lexical analysis.
- Source code known to begin with the NULL character is disallowed.
Furthermore, as long as the first logical characters in the program is under codepoint 0xFF (the "ASCII/Latin1" range), this heuristic can always gracefully handle two common classes of text editor flaws:
- Emitting byte-order mark for UTF-8 text.
- Omitting byte-order mark for UTF-16 or UTF-32 text.
Pros
- Ensures uniform treatment of Unicode in source code.
- Disallows implicit ISO-8859-* encodings in source code, ensuring portability.
Cons
- Mandating a minimum support for UTF-8/UTF-16 places an implementation burden on compiler writers.
- Existing code relying on a non-UTF8, locale-/implementation-specific encoding will need conversion. (At present, this is mostly Latin-1.)