Changes

autrijus@gmail.com · 403628a4
--- a/source-encoding-detection.md
+++ b/source-encoding-detection.md
+
+
+
 # Source Encoding Detection

+
 ## Brief Explanation


-Haskell source code uses the Unicode character set.  However, current implementations either support only one encoding (e.g. UTF-8), or require the encoding to be signified via out-of-band means, which makes Haskell source code outside the ASCII range non-portable (see [UnicodeInHaskellSource](unicode-in-haskell-source)).
+
+Haskell source code uses the Unicode character set.  However, current implementations either support only one encoding (e.g. UTF-8), or require the encoding to be signified via out-of-band means, which makes Haskell source code non-portable.
+


-This proposal outlines a detection heuristic that categorizes the source code as under UTF-8, UTF-16 or UTF-32.  A conforming Haskell-prime implementation must accept UTF-8 and UTF-16, and may fail on UTF-32 input.
+This proposal outlines a detection heuristics that categorizes the source code as under UTF-8, UTF-16 or UTF-32.  A conforming Haskell-prime implementation must accept UTF-8 and UTF-16, and may fail on UTF-32 input.
+


 This proposal does not cover user-specified source encoding.

-## References

- [ Unicode UTF and Byte Order Mark FAQ](http://www.unicode.org/faq/utf_bom.html)
+## Details
+

-## Proposal

+This heuristics uses at most 4 bytes from the byte representation of Haskell source code.

-This heuristic uses at most 4 bytes from the byte representation of Haskell source code.

 ```wiki
 import Data.Word
@@ -47,7 +53,7 @@ detectSourceEncoding bytes = case bytes of
    (0xEF:0xBB:0xBF:xs)         -> UTF8 xs
    (0x00:0x00:0xFE:0xFF:xs)    -> UTF32 BigEndian xs
    (0xFF:0xFE:0x00:0x00:xs)    -> UTF32 LittleEndian xs
-    (0xFF:0xFE:xs)              -> UTF16 LittleEndian xs
+    (0xFF:0xFE:xs)              -> UTF16 BigEndian xs
    (0x00:0x00:0x00:0x00:_)     -> invalidNulls
    xs@(0x00:0x00:0x00:_)       -> UTF32 BigEndian xs
    xs@(_:0x00:0x00:0x00:_)     -> UTF32 LittleEndian xs
@@ -60,27 +66,28 @@ detectSourceEncoding bytes = case bytes of
 ```


-The heuristic has the following properties:
+The heuristics has the following properties:
+

 - Byte-order mark is optional on all three encodings.
 - If present, byte-order-marks are consumed before lexical analysis.
- Source code known to begin with the NULL character is disallowed.
+- Source code known to begin with the NULL chracter is disallowed.


 Furthermore, as long as the first logical characters in the program is
-under codepoint 0xFF (the "ASCII/Latin1" range), this heuristic can always
-gracefully handle two common classes of text editor flaws:
+under codepoint 0xFF (the "ASCII/Latin1" range), this heuristics can always
+gracefully handle two common class of text editor flaws:
+

 - Emitting byte-order mark for UTF-8 text.
 - Omitting byte-order mark for UTF-16 or UTF-32 text.

 ## Pros

+
 - Ensures uniform treatment of Unicode in source code.
- Disallows implicit ISO-8859-\* encodings in source code, ensuring portability.

 ## Cons

+
 - Mandating a minimum support for UTF-8/UTF-16 places an implementation burden on compiler writers.
- Existing code relying on a non-UTF8, locale-/implementation-specific encoding will need conversion.
-  (At present, this is mostly Latin-1.)