Changes

ross@soi.city.ac.uk · ce1dd571
--- a/source-encoding-detection.md
+++ b/source-encoding-detection.md
@@ -8,22 +8,27 @@



-Haskell source code uses the Unicode character set.  However, current implementations either support only one encoding (e.g. UTF-8), or require the encoding to be signified via out-of-band means, which makes Haskell source code outside ASCII range non-portable.
+Haskell source code uses the Unicode character set.  However, current implementations either support only one encoding (e.g. UTF-8), or require the encoding to be signified via out-of-band means, which makes Haskell source code outside the ASCII range non-portable (see [UnicodeInHaskellSource](unicode-in-haskell-source)).



-This proposal outlines a detection heuristics that categorizes the source code as under UTF-8, UTF-16 or UTF-32.  A conforming Haskell-prime implementation must accept UTF-8 and UTF-16, and may fail on UTF-32 input.
+This proposal outlines a detection heuristic that categorizes the source code as under UTF-8, UTF-16 or UTF-32.  A conforming Haskell-prime implementation must accept UTF-8 and UTF-16, and may fail on UTF-32 input.



 This proposal does not cover user-specified source encoding.


+## References
+
+
+- [ Unicode UTF and Byte Order Mark FAQ](http://www.unicode.org/faq/utf_bom.html)
+
 ## Proposal



-This heuristics uses at most 4 bytes from the byte representation of Haskell source code.
+This heuristic uses at most 4 bytes from the byte representation of Haskell source code.


 ```wiki
@@ -66,17 +71,17 @@ detectSourceEncoding bytes = case bytes of
 ```


-The heuristics has the following properties:
+The heuristic has the following properties:


 - Byte-order mark is optional on all three encodings.
 - If present, byte-order-marks are consumed before lexical analysis.
- Source code known to begin with the NULL chracter is disallowed.
+- Source code known to begin with the NULL character is disallowed.


 Furthermore, as long as the first logical characters in the program is
-under codepoint 0xFF (the "ASCII/Latin1" range), this heuristics can always
-gracefully handle two common class of text editor flaws:
+under codepoint 0xFF (the "ASCII/Latin1" range), this heuristic can always
+gracefully handle two common classes of text editor flaws:


 - Emitting byte-order mark for UTF-8 text.
@@ -93,3 +98,4 @@ gracefully handle two common class of text editor flaws:

 - Mandating a minimum support for UTF-8/UTF-16 places an implementation burden on compiler writers.
 - Existing code relying on a non-UTF8, locale-/implementation-specific encoding will need conversion.
+  (At present, this is mostly Latin-1.)