Changes

autrijus@gmail.com · 1f4ab055
--- a/source-encoding-detection.md
+++ b/source-encoding-detection.md
+# Source Encoding Detection
+## Brief Explanation
+Haskell source code uses the Unicode character set.  However, current implementations either support only one encoding (e.g. UTF-8), or require the encoding to be signified via out-of-band means, which makes Haskell source code non-portable.
+This proposal outlines a detection heuristics that categorizes the source code as under UTF-8, UTF-16 or UTF-32.  A conforming Haskell-prime implementation must accept UTF-8 and UTF-16, and may fail on UTF-32 input.
+This proposal does not cover user-specified source encoding.
+## Details
+This heuristics uses at most 4 bytes from the byte representation of Haskell source code.
+```wiki
+import Data.Word
+data EncodedSource
+    = UTF8 [Word8]
+    | UTF16 Endian [Word8]
+    | UTF32 Endian [Word8]
+ -- | UserDefined ...
+data Endian = LittleEndian | BigEndian
+detectSourceEncoding :: [Word8] -> EncodedSource
+detectSourceEncoding bytes = case bytes of
+    []                          -> UTF8 []
+    [0x00]                      -> invalidNulls
+    xs@[_]                      -> UTF8 xs
+    [0xFF, 0xFE]                -> UTF16 LittleEndian []
+    (0xFE:0xFF:xs)              -> UTF16 BigEndian xs
+    [0x00, 0x00]                -> invalidNulls
+    xs@[0x00, _]                -> UTF16 BigEndian xs
+    xs@[_, 0x00]                -> UTF16 LittleEndian xs
+    xs@[_, _]                   -> UTF8 xs
+    [0x00, 0x00, 0x00]          -> invalidNulls
+    xs@[_, _, _]                -> UTF8 xs
+    (0xEF:0xBB:0xBF:xs)         -> UTF8 xs
+    (0x00:0x00:0xFE:0xFF:xs)    -> UTF32 BigEndian xs
+    (0xFF:0xFE:0x00:0x00:xs)    -> UTF32 LittleEndian xs
+    (0xFF:0xFE:xs)              -> UTF16 BigEndian xs
+    (0x00:0x00:0x00:0x00:_)     -> invalidNulls
+    xs@(0x00:0x00:0x00:_)       -> UTF32 BigEndian xs
+    xs@(_:0x00:0x00:0x00:_)     -> UTF32 LittleEndian xs
+    (0x00:0x00:_)               -> invalidNulls
+    xs@(0x00:_)                 -> UTF16 BigEndian xs
+    xs@(_:0x00:_)               -> UTF16 LittleEndian xs
+    xs                          -> UTF8 xs
+    where
+    invalidNulls = error "(implementation-specific error message)"
+```
+The heuristics has the following properties:
+- Byte-order mark is optional on all three encodings.
+- If present, byte-order-marks are consumed before lexical analysis.
+- Source code known to begin with the NULL chracter is disallowed.
+Furthermore, as long as the first logical characters in the program is
+under codepoint 0xFF (the "ASCII/Latin1" range), this heuristics can always
+gracefully handle two common class of text editor flaws:
+- Emitting byte-order mark for UTF-8 text.
+- Omitting byte-order mark for UTF-16 or UTF-32 text.
+## Pros
+- Ensures uniform treatment of Unicode in source code.
+## Cons
+- Mandating a minimum support for UTF-8/UTF-16 places an implementation burden on compiler writers.