|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
# Source Encoding Detection
|
|
|
|
|
|
|
|
|
|
|
|
## Brief Explanation
|
|
|
|
|
|
|
|
|
|
|
|
Haskell source code uses the Unicode character set. However, current implementations either support only one encoding (e.g. UTF-8), or require the encoding to be signified via out-of-band means, which makes Haskell source code outside the ASCII range non-portable (see [UnicodeInHaskellSource](unicode-in-haskell-source)).
|
|
|
|
|
|
|
|
Haskell source code uses the Unicode character set. However, current implementations either support only one encoding (e.g. UTF-8), or require the encoding to be signified via out-of-band means, which makes Haskell source code non-portable.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This proposal outlines a detection heuristic that categorizes the source code as under UTF-8, UTF-16 or UTF-32. A conforming Haskell-prime implementation must accept UTF-8 and UTF-16, and may fail on UTF-32 input.
|
|
|
|
This proposal outlines a detection heuristics that categorizes the source code as under UTF-8, UTF-16 or UTF-32. A conforming Haskell-prime implementation must accept UTF-8 and UTF-16, and may fail on UTF-32 input.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This proposal does not cover user-specified source encoding.
|
|
|
|
|
|
|
|
## References
|
|
|
|
|
|
|
|
- [ Unicode UTF and Byte Order Mark FAQ](http://www.unicode.org/faq/utf_bom.html)
|
|
|
|
## Details
|
|
|
|
|
|
|
|
|
|
|
|
## Proposal
|
|
|
|
|
|
|
|
This heuristics uses at most 4 bytes from the byte representation of Haskell source code.
|
|
|
|
|
|
|
|
This heuristic uses at most 4 bytes from the byte representation of Haskell source code.
|
|
|
|
|
|
|
|
```wiki
|
|
|
|
import Data.Word
|
| ... | ... | @@ -47,7 +53,7 @@ detectSourceEncoding bytes = case bytes of |
|
|
|
(0xEF:0xBB:0xBF:xs) -> UTF8 xs
|
|
|
|
(0x00:0x00:0xFE:0xFF:xs) -> UTF32 BigEndian xs
|
|
|
|
(0xFF:0xFE:0x00:0x00:xs) -> UTF32 LittleEndian xs
|
|
|
|
(0xFF:0xFE:xs) -> UTF16 LittleEndian xs
|
|
|
|
(0xFF:0xFE:xs) -> UTF16 BigEndian xs
|
|
|
|
(0x00:0x00:0x00:0x00:_) -> invalidNulls
|
|
|
|
xs@(0x00:0x00:0x00:_) -> UTF32 BigEndian xs
|
|
|
|
xs@(_:0x00:0x00:0x00:_) -> UTF32 LittleEndian xs
|
| ... | ... | @@ -60,27 +66,28 @@ detectSourceEncoding bytes = case bytes of |
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
The heuristic has the following properties:
|
|
|
|
The heuristics has the following properties:
|
|
|
|
|
|
|
|
|
|
|
|
- Byte-order mark is optional on all three encodings.
|
|
|
|
- If present, byte-order-marks are consumed before lexical analysis.
|
|
|
|
- Source code known to begin with the NULL character is disallowed.
|
|
|
|
- Source code known to begin with the NULL chracter is disallowed.
|
|
|
|
|
|
|
|
|
|
|
|
Furthermore, as long as the first logical characters in the program is
|
|
|
|
under codepoint 0xFF (the "ASCII/Latin1" range), this heuristic can always
|
|
|
|
gracefully handle two common classes of text editor flaws:
|
|
|
|
under codepoint 0xFF (the "ASCII/Latin1" range), this heuristics can always
|
|
|
|
gracefully handle two common class of text editor flaws:
|
|
|
|
|
|
|
|
|
|
|
|
- Emitting byte-order mark for UTF-8 text.
|
|
|
|
- Omitting byte-order mark for UTF-16 or UTF-32 text.
|
|
|
|
|
|
|
|
## Pros
|
|
|
|
|
|
|
|
|
|
|
|
- Ensures uniform treatment of Unicode in source code.
|
|
|
|
- Disallows implicit ISO-8859-\* encodings in source code, ensuring portability.
|
|
|
|
|
|
|
|
## Cons
|
|
|
|
|
|
|
|
|
|
|
|
- Mandating a minimum support for UTF-8/UTF-16 places an implementation burden on compiler writers. |
|
|
|
- Existing code relying on a non-UTF8, locale-/implementation-specific encoding will need conversion.
|
|
|
|
(At present, this is mostly Latin-1.) |