|
|
|
# Source Encoding Detection
|
|
|
|
|
|
|
|
## Brief Explanation
|
|
|
|
|
|
|
|
|
|
|
|
Haskell source code uses the Unicode character set. However, current implementations either support only one encoding (e.g. UTF-8), or require the encoding to be signified via out-of-band means, which makes Haskell source code non-portable.
|
|
|
|
|
|
|
|
|
|
|
|
This proposal outlines a detection heuristics that categorizes the source code as under UTF-8, UTF-16 or UTF-32. A conforming Haskell-prime implementation must accept UTF-8 and UTF-16, and may fail on UTF-32 input.
|
|
|
|
|
|
|
|
|
|
|
|
This proposal does not cover user-specified source encoding.
|
|
|
|
|
|
|
|
## Details
|
|
|
|
|
|
|
|
|
|
|
|
This heuristics uses at most 4 bytes from the byte representation of Haskell source code.
|
|
|
|
|
|
|
|
```wiki
|
|
|
|
import Data.Word
|
|
|
|
|
|
|
|
data EncodedSource
|
|
|
|
= UTF8 [Word8]
|
|
|
|
| UTF16 Endian [Word8]
|
|
|
|
| UTF32 Endian [Word8]
|
|
|
|
-- | UserDefined ...
|
|
|
|
|
|
|
|
data Endian = LittleEndian | BigEndian
|
|
|
|
|
|
|
|
detectSourceEncoding :: [Word8] -> EncodedSource
|
|
|
|
detectSourceEncoding bytes = case bytes of
|
|
|
|
[] -> UTF8 []
|
|
|
|
[0x00] -> invalidNulls
|
|
|
|
xs@[_] -> UTF8 xs
|
|
|
|
[0xFF, 0xFE] -> UTF16 LittleEndian []
|
|
|
|
(0xFE:0xFF:xs) -> UTF16 BigEndian xs
|
|
|
|
[0x00, 0x00] -> invalidNulls
|
|
|
|
xs@[0x00, _] -> UTF16 BigEndian xs
|
|
|
|
xs@[_, 0x00] -> UTF16 LittleEndian xs
|
|
|
|
xs@[_, _] -> UTF8 xs
|
|
|
|
[0x00, 0x00, 0x00] -> invalidNulls
|
|
|
|
xs@[_, _, _] -> UTF8 xs
|
|
|
|
(0xEF:0xBB:0xBF:xs) -> UTF8 xs
|
|
|
|
(0x00:0x00:0xFE:0xFF:xs) -> UTF32 BigEndian xs
|
|
|
|
(0xFF:0xFE:0x00:0x00:xs) -> UTF32 LittleEndian xs
|
|
|
|
(0xFF:0xFE:xs) -> UTF16 BigEndian xs
|
|
|
|
(0x00:0x00:0x00:0x00:_) -> invalidNulls
|
|
|
|
xs@(0x00:0x00:0x00:_) -> UTF32 BigEndian xs
|
|
|
|
xs@(_:0x00:0x00:0x00:_) -> UTF32 LittleEndian xs
|
|
|
|
(0x00:0x00:_) -> invalidNulls
|
|
|
|
xs@(0x00:_) -> UTF16 BigEndian xs
|
|
|
|
xs@(_:0x00:_) -> UTF16 LittleEndian xs
|
|
|
|
xs -> UTF8 xs
|
|
|
|
where
|
|
|
|
invalidNulls = error "(implementation-specific error message)"
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
The heuristics has the following properties:
|
|
|
|
|
|
|
|
- Byte-order mark is optional on all three encodings.
|
|
|
|
- If present, byte-order-marks are consumed before lexical analysis.
|
|
|
|
- Source code known to begin with the NULL chracter is disallowed.
|
|
|
|
|
|
|
|
|
|
|
|
Furthermore, as long as the first logical characters in the program is
|
|
|
|
under codepoint 0xFF (the "ASCII/Latin1" range), this heuristics can always
|
|
|
|
gracefully handle two common class of text editor flaws:
|
|
|
|
|
|
|
|
- Emitting byte-order mark for UTF-8 text.
|
|
|
|
- Omitting byte-order mark for UTF-16 or UTF-32 text.
|
|
|
|
|
|
|
|
## Pros
|
|
|
|
|
|
|
|
- Ensures uniform treatment of Unicode in source code.
|
|
|
|
|
|
|
|
## Cons
|
|
|
|
|
|
|
|
- Mandating a minimum support for UTF-8/UTF-16 places an implementation burden on compiler writers. |