| ... | ... | @@ -49,20 +49,27 @@ Some things we could do: |
|
|
|
|
|
|
|
If Unicode is allowed, should its use be restricted, e.g. to character and string literals?
|
|
|
|
|
|
|
|
## Unicode in Haskell IO
|
|
|
|
## The Char type
|
|
|
|
|
|
|
|
- All character based IO in jhc compiled programs is carried out in the current locale of the system
|
|
|
|
- in nhc and ghc, character based IO is carried out as if it were latin1.
|
|
|
|
|
|
|
|
## The Char type
|
|
|
|
The Haskell 98 Report claims that the type `Char` represents Unicode, which seems to be the canonical choice.
|
|
|
|
The functions of `Char` work with Unicode for GHC and Hugs, with one divergence from the Report:
|
|
|
|
|
|
|
|
- `isAlpha` selects Unicode alphabetic characters, not just the union of lower- and upper-case letters.
|
|
|
|
|
|
|
|
## I/O and System functions
|
|
|
|
|
|
|
|
|
|
|
|
The Report goes on to provide I/O primitives using the `Char` type, define `FilePath` as `String`, and have the functions in `System` use `String`.
|
|
|
|
|
|
|
|
The Haskell 98 Report claims that the type `Char` represents Unicode. It goes on to provide I/O primitives using the `Char` type, define `FilePath` as `[Char]`, etc. Most implementations treat the octets interchanged with the operation system (file contents, filenames, program arguments and the environment) as characters, i.e. belonging to the Latin-1 subset. Hugs treats them as using a byte encoding of Unicode determined by the current locale, with the disadvantage that some byte strings may not be legal encodings.
|
|
|
|
- Hugs treats the bytes interchanged with the operation system (I/O, filenames, program arguments and the environment) as using a byte encoding of Unicode determined by the current locale, with the disadvantage that some byte strings may not be legal encodings.
|
|
|
|
- All character based I/O in jhc-compiled programs uses the encoding of the current locale . Handling of strings will be similar when the CString functions become conformant.
|
|
|
|
- Other implementations treat the bytes interchanged with the operation system as characters, i.e. belonging to the Latin-1 subset.
|
|
|
|
|
|
|
|
|
|
|
|
Using Unicode for `Char` seems the principled thing to do. If we retain it:
|
|
|
|
Assuming we retain Unicode as the representation of `Char`:
|
|
|
|
|
|
|
|
- Flexible handling of character encodings is needed, but not necessarily as part of this standard.
|
|
|
|
- [BinaryIO](binary-io) is needed anyway, and would provide a base for these encodings.
|
|
|
|
- A simple character-based I/O interface like that in Haskell 98, possibly taking defaults from the locale, will also be convenient for many users.
|
|
|
|
- A simple character-based I/O and system interface like that in Haskell 98, possibly taking defaults from the locale, will also be convenient for many users.
|
|
|
|
- An abstract type may be needed for data in O/S form, such as filenames, program arguments and the environment. |