Changes

ross@soi.city.ac.uk · d2e86b13
--- a/unicode.md
+++ b/unicode.md
@@ -49,20 +49,27 @@ Some things we could do:

 If Unicode is allowed, should its use be restricted, e.g. to character and string literals?

-## Unicode in Haskell IO
+## The Char type

- All character based IO in jhc compiled programs is carried out in the current locale of the system
- in nhc and ghc, character based IO is carried out as if it were latin1.

-## The Char type
+The Haskell 98 Report claims that the type `Char` represents Unicode, which seems to be the canonical choice.
+The functions of `Char` work with Unicode for GHC and Hugs, with one divergence from the Report:
+
+- `isAlpha` selects Unicode alphabetic characters, not just the union of lower- and upper-case letters.
+
+## I/O and System functions
+

+The Report goes on to provide I/O primitives using the `Char` type, define `FilePath` as `String`, and have the functions in `System` use `String`.

-The Haskell 98 Report claims that the type `Char` represents Unicode. It goes on to provide I/O primitives using the `Char` type, define `FilePath` as `[Char]`, etc. Most implementations treat the octets interchanged with the operation system (file contents, filenames, program arguments and the environment) as characters, i.e. belonging to the Latin-1 subset. Hugs treats them as using a byte encoding of Unicode determined by the current locale, with the disadvantage that some byte strings may not be legal encodings.
+- Hugs treats the bytes interchanged with the operation system (I/O, filenames, program arguments and the environment) as using a byte encoding of Unicode determined by the current locale, with the disadvantage that some byte strings may not be legal encodings.
+- All character based I/O in jhc-compiled programs uses the encoding of the current locale . Handling of strings will be similar when the CString functions become conformant.
+- Other implementations treat the bytes interchanged with the operation system as characters, i.e. belonging to the Latin-1 subset.


-Using Unicode for `Char` seems the principled thing to do. If we retain it:
+Assuming we retain Unicode as the representation of `Char`:

 - Flexible handling of character encodings is needed, but not necessarily as part of this standard.
 - [BinaryIO](binary-io) is needed anyway, and would provide a base for these encodings.
- A simple character-based I/O interface like that in Haskell 98, possibly taking defaults from the locale, will also be convenient for many users.
+- A simple character-based I/O and system interface like that in Haskell 98, possibly taking defaults from the locale, will also be convenient for many users.
 - An abstract type may be needed for data in O/S form, such as filenames, program arguments and the environment.