|
|
|
# Unicode
|
|
|
|
# Background on Unicode
|
|
|
|
|
|
|
|
|
|
|
|
The Haskell'98 Report claims the language uses Unicode (aka ISO 10646-1). Most of the rest of the world uses something else, or at least some encoding, and the Report is silent on how this gap is to be bridged. There are still no implementations that comply fully with Unicode.
|
|
|
|
|
|
|
|
## Background on Unicode
|
|
|
|
|
|
|
|
- [ Unicode](http://www.unicode.org) defines a finite set of abstract characters and assigns them code points in the range 0x0 to 0x10ffff (i.e. a 21-bit quantity).
|
|
|
|
- [ Unicode](http://www.unicode.org) (or equivalently ISO 10646-1) defines a finite set of abstract characters and assigns them code points in the range 0x0 to 0x10ffff (i.e. a 21-bit quantity).
|
|
|
|
- Characters are distinguished from glyphs: the presentation of a character may vary with style, locale, etc, and some glyphs may correspond to a sequence of characters (e.g. base character and combining mark characters).
|
|
|
|
- Unicode has a story for the display of mixed left-to-right and right-to-left scripts (the [ BiDi algorithm](http://www.unicode.org/reports/tr9/)).
|
|
|
|
- The first 128 code points match US-ASCII; the first 256 code points match ISO 8859-1 (Latin-1).
|
|
|
|
- For reasons of backwards compatibility and space efficiency, there are a variety of *variable-length* encodings of the
|
|
|
|
code points themselves into byte streams.
|
| ... | ... | @@ -28,148 +24,5 @@ The Haskell'98 Report claims the language uses Unicode (aka ISO 10646-1). Most o |
|
|
|
- Any system must be able to read/write files that originated on any other platform.
|
|
|
|
- As an example of the complex heuristics needed to guess the encoding of any particular file, see the [ XML standard](http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-guessing).
|
|
|
|
|
|
|
|
## Unicode in Haskell source
|
|
|
|
|
|
|
|
|
|
|
|
The Haskell 98 Report claims that Haskell source code uses the Unicode character set.
|
|
|
|
If Unicode were allowed, how would implementations know which encoding was used?
|
|
|
|
|
|
|
|
- Jhc is the only implementation that allows unrestricted use of the Unicode character set in Haskell source, treating input as UTF-8.
|
|
|
|
- Hugs treats input as being in the encoding specified by the current locale, but permits Unicode only in comments and character and string literals.
|
|
|
|
- Others treat source code as Latin-1.
|
|
|
|
- Jhc supports several uses of unicode characters instead of the haskell
|
|
|
|
keywords.
|
|
|
|
|
|
|
|
- \[chr 0x2192\] '→' is equivalent to '→'
|
|
|
|
- \[chr 0x2190\] '←' is equivalent to '←
|
|
|
|
- \[chr 0x2237\] '∷' is equivalent to '::'
|
|
|
|
- \[chr 0x2025\] '‥' is equivalent to '..'
|
|
|
|
- \[chr 0x21d2\] '⇒' is equivalent to '⇒'
|
|
|
|
- \[chr 0x2200\] '∀' is equivalent to 'forall'
|
|
|
|
- \[chr 0x2203\] '∃' is equivalent to 'exists' — future extension will use
|
|
|
|
- in addition there is experimental support for defining new operators and
|
|
|
|
names using various unicode characters.
|
|
|
|
|
|
|
|
|
|
|
|
Some things we could do:
|
|
|
|
|
|
|
|
- Revert to US-ASCII, Latin-1 or implementation-defined character sets.
|
|
|
|
- Allow Unicode, defining a portable form (the \\uNNNN escapes in Haskell 1.4 were an attempt at this).
|
|
|
|
- Allow Unicode, with a mechanism for specifying encoding in the source file.
|
|
|
|
- Allow Unicode with the encoding specified by the current locale (as currently done by Hugs). This is arguably the correct thing for all programs that read text files, but it makes Haskell source using non-ASCII characters non-portable. (We could specify that all compilers must support UTF-8 and/or some other portable form too.)
|
|
|
|
|
|
|
|
|
|
|
|
If Unicode is allowed, should its use be restricted, e.g. to character and string literals?
|
|
|
|
|
|
|
|
|
|
|
|
What about supporting scripts where characters are written right-to-left instead of left-to-right. In theory, you still have a simple sequence of characters, but what about mixing L-to-R and R-to-L scripts, e.g. the spelling of language keywords? Should 'module' be spelled 'eludom' in a R-to-L encoding? Or is this just a text-editor/visualisation problem that does not concern the language standard? (Note that, for instance in Arabic script, words are R-to-L but numbers are L-to-R.)
|
|
|
|
|
|
|
|
|
|
|
|
Unicode has varying-width space characters (m-width, n-width, l-width, non-breaking space, narrow non-breaking space, zero-width non-breaking space…) How do these interact with the layout rules?
|
|
|
|
|
|
|
|
## The Char type
|
|
|
|
|
|
|
|
|
|
|
|
The Haskell 98 Report claims that the type `Char` represents Unicode, which seems to be the canonical choice.
|
|
|
|
The functions of `Char` work with Unicode for GHC and Hugs, with one divergence from the Report:
|
|
|
|
|
|
|
|
- `isAlpha` selects Unicode alphabetic characters, not just the union of lower- and upper-case letters.
|
|
|
|
|
|
|
|
## Input and Output
|
|
|
|
|
|
|
|
|
|
|
|
Haskell 98 provides I/O primitives using the `Char` type.
|
|
|
|
|
|
|
|
- All character based I/O in Hugs and jhc-compiled programs uses the encoding of the current locale.
|
|
|
|
- Other implementations perform I/O on bytes treated as characters, i.e. belonging to the Latin-1 subset.
|
|
|
|
|
|
|
|
|
|
|
|
Assuming we retain Unicode as the representation of `Char`:
|
|
|
|
|
|
|
|
- Flexible handling of character encodings will be needed, but there is no existing implementation. Should we specify it or leave room for experimentation?
|
|
|
|
- [BinaryIO](binary-io) is needed anyway, and would provide a base for these encodings.
|
|
|
|
- A simple character-based I/O and system interface like that in Haskell 98, possibly taking defaults from the locale, will also be convenient for many users. However it might not be in the Prelude if we [shrink the Prelude](prelude).
|
|
|
|
|
|
|
|
## Strings in System functions
|
|
|
|
|
|
|
|
|
|
|
|
Native system calls use varying representations of strings:
|
|
|
|
|
|
|
|
- Unix-like systems and many others use byte strings, which may use various encodings (or may not be character data at all).
|
|
|
|
- The NTFS file system (Windows) stores filenames in UTF-16, and the Win32 interface provides functions using UTF-16. Since Windows NT, the byte-level interface is a compatibility layer over UTF-16.
|
|
|
|
|
|
|
|
|
|
|
|
Haskell 98 defines `FilePath` as `String`; the functions in `System` use `String` for program arguments and environment values.
|
|
|
|
|
|
|
|
- Hugs exchanges byte-strings using a byte encoding of Unicode determined by the current locale.
|
|
|
|
- Other implementations treat the byte-strings interchanged with the operation system as characters, i.e. belonging to the Latin-1 subset.
|
|
|
|
- The [ForeignFunctionInterface](foreign-function-interface) specifies `CString` functions that perform locale-based conversion, but these are not yet provided by the Haskell implementations.
|
|
|
|
|
|
|
|
|
|
|
|
A disadvantage or using encodings is that some byte-strings may not be legal encodings, e.g. using a program argument as a filename may fail. Converting to `String` and back may also lose distinctions for some encodings. On the other hand, byte-strings are inappropriate if the underlying system uses a form of Unicode (e.g. recent Windows, and possibly more systems in the future). One way out would be to provide an abstract type for strings in O/S form. Again, the old character interface would remain useful for many.
|
|
|
|
|
|
|
|
## A Straw-Man Proposal
|
|
|
|
|
|
|
|
- **Internal character representation.**
|
|
|
|
The Haskell type `Char` is UCS-4.
|
|
|
|
- **Haskell source encoding.**
|
|
|
|
|
|
|
|
- Introduce a pragma `{-# ENCODING e #-`} with a range of possible
|
|
|
|
values of the encoding `e`. If the pragma is present, it must be
|
|
|
|
at the beginning of the file. If it is not present, the file is
|
|
|
|
encoded in ASCII. Note that even if the pragma is present, some
|
|
|
|
heuristic may be needed even to get as far as interpreting the
|
|
|
|
encoding declaration, like in
|
|
|
|
[ XML](http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-guessing).
|
|
|
|
The fact that the first three characters must be `{-#` will be
|
|
|
|
useful here. Haskell compilers must support at least the encodings ASCII,LATIN1, and UTF8.
|
|
|
|
- A literal string may contain any literal character representable in the
|
|
|
|
source encoding. In addition, escapes are provided to permit the specification of
|
|
|
|
*any* Unicode character (which may or may not be otherwise
|
|
|
|
representable in the source encoding).
|
|
|
|
- An identifier may contain any Unicode alphanumeric or symbol
|
|
|
|
characters from a defined range. Thus, a source text may not be
|
|
|
|
representable in certain other encodings (especially in ASCII).
|
|
|
|
- **I/O.**
|
|
|
|
All raw I/O is in terms of octets, i.e. `Word8`
|
|
|
|
- **Conversions.**
|
|
|
|
Pure functions exist to convert octets to and from any particular encoding:
|
|
|
|
|
|
|
|
```wiki
|
|
|
|
stringDecode :: Encoding -> [Word8] -> [Char]
|
|
|
|
stringEncode :: Encoding -> [Char] -> [Word8]
|
|
|
|
```
|
|
|
|
|
|
|
|
The codecs must operate on strings, not individual characters, because some
|
|
|
|
encodings use variable-length sequences of octets.
|
|
|
|
- **Efficiency.**
|
|
|
|
Semantically, character-based I/O is a simple composition of the raw
|
|
|
|
I/O primitives with an encoding conversion function. However, for
|
|
|
|
efficiency, an implementation might choose to provide certain encoded
|
|
|
|
I/O operations primitively. If such primitives are exposed to the
|
|
|
|
user, they should have standard names so that other implementations can
|
|
|
|
provide the same functionality in pure Haskell Prime.
|
|
|
|
- **Locales.**
|
|
|
|
It may be possible to retain the traditional I/O signatures for
|
|
|
|
hGetChar, hPutChar, readFile, writeFile, etc, but only by introducing
|
|
|
|
a stateful notion of *current encoding* associated with each
|
|
|
|
individual handle. The default encoding could be inherited from the
|
|
|
|
operating system environment, but it should also be possible to
|
|
|
|
change the encoding explicitly.
|
|
|
|
|
|
|
|
```wiki
|
|
|
|
getIOEncoding :: Handle -> IO Encoding
|
|
|
|
setIOEncoding :: Encoding -> Handle -> IO ()
|
|
|
|
resetIOEncoding :: Handle -> IO () -- go back to default
|
|
|
|
```
|
|
|
|
- **Filenames, program arguments, environment.**
|
|
|
|
|
|
|
|
- Filenames are stored in Haskell as `[Char]`, but the operating
|
|
|
|
system should receive `[Word8]` for any I/O using filenames.
|
|
|
|
Some encoding conversion is therefore required. Usually, this will
|
|
|
|
be platform-dependent, and so the actual encoding may be hidden
|
|
|
|
from the programmer as part of the default locale.
|
|
|
|
- Program arguments, and symbols from the environment, are supplied
|
|
|
|
by the operating system to the Haskell program as `[Word8]`.
|
|
|
|
The program is responsible for conversion to `[Char]`. Again,
|
|
|
|
there may be a default encoding chosen based on the locale. |
|
|
|
See also [UnicodeInHaskellSource](unicode-in-haskell-source) and [CharAsUnicode](char-as-unicode). |