| ... | ... | @@ -8,22 +8,27 @@ |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Haskell source code uses the Unicode character set. However, current implementations either support only one encoding (e.g. UTF-8), or require the encoding to be signified via out-of-band means, which makes Haskell source code outside ASCII range non-portable.
|
|
|
|
Haskell source code uses the Unicode character set. However, current implementations either support only one encoding (e.g. UTF-8), or require the encoding to be signified via out-of-band means, which makes Haskell source code outside the ASCII range non-portable (see [UnicodeInHaskellSource](unicode-in-haskell-source)).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This proposal outlines a detection heuristics that categorizes the source code as under UTF-8, UTF-16 or UTF-32. A conforming Haskell-prime implementation must accept UTF-8 and UTF-16, and may fail on UTF-32 input.
|
|
|
|
This proposal outlines a detection heuristic that categorizes the source code as under UTF-8, UTF-16 or UTF-32. A conforming Haskell-prime implementation must accept UTF-8 and UTF-16, and may fail on UTF-32 input.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This proposal does not cover user-specified source encoding.
|
|
|
|
|
|
|
|
|
|
|
|
## References
|
|
|
|
|
|
|
|
|
|
|
|
- [ Unicode UTF and Byte Order Mark FAQ](http://www.unicode.org/faq/utf_bom.html)
|
|
|
|
|
|
|
|
## Proposal
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This heuristics uses at most 4 bytes from the byte representation of Haskell source code.
|
|
|
|
This heuristic uses at most 4 bytes from the byte representation of Haskell source code.
|
|
|
|
|
|
|
|
|
|
|
|
```wiki
|
| ... | ... | @@ -66,17 +71,17 @@ detectSourceEncoding bytes = case bytes of |
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
The heuristics has the following properties:
|
|
|
|
The heuristic has the following properties:
|
|
|
|
|
|
|
|
|
|
|
|
- Byte-order mark is optional on all three encodings.
|
|
|
|
- If present, byte-order-marks are consumed before lexical analysis.
|
|
|
|
- Source code known to begin with the NULL chracter is disallowed.
|
|
|
|
- Source code known to begin with the NULL character is disallowed.
|
|
|
|
|
|
|
|
|
|
|
|
Furthermore, as long as the first logical characters in the program is
|
|
|
|
under codepoint 0xFF (the "ASCII/Latin1" range), this heuristics can always
|
|
|
|
gracefully handle two common class of text editor flaws:
|
|
|
|
under codepoint 0xFF (the "ASCII/Latin1" range), this heuristic can always
|
|
|
|
gracefully handle two common classes of text editor flaws:
|
|
|
|
|
|
|
|
|
|
|
|
- Emitting byte-order mark for UTF-8 text.
|
| ... | ... | @@ -93,3 +98,4 @@ gracefully handle two common class of text editor flaws: |
|
|
|
|
|
|
|
- Mandating a minimum support for UTF-8/UTF-16 places an implementation burden on compiler writers.
|
|
|
|
- Existing code relying on a non-UTF8, locale-/implementation-specific encoding will need conversion.
|
|
|
|
(At present, this is mostly Latin-1.) |