|
|
|
# The Char type
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The Haskell 98 Report ([ Characters and Strings](http://www.haskell.org/onlinereport/basic.html#characters)) states that the type `Char` represents [Unicode](unicode), which seems to be the canonical choice.
|
|
|
|
The functions of the [ Char](http://www.haskell.org/onlinereport/char.html) module work with Unicode for GHC and Hugs, with one divergence from the Report:
|
|
|
|
|
|
|
|
|
|
|
|
- `isAlpha` selects Unicode alphabetic characters, not just the union of lower- and upper-case letters.
|
|
|
|
|
|
|
|
|
|
|
|
More sophisticated functions could be provided by additional libraries.
|
|
|
|
|
|
|
|
|
|
|
|
## Input and Output
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The Haskell 98 [ Prelude](http://www.haskell.org/onlinereport/standard-prelude.html#preludeio) and [ IO](http://www.haskell.org/onlinereport/io.html) modules provide I/O primitives using the `Char` type.
|
|
|
|
|
|
|
|
|
|
|
|
- All character based I/O in Hugs and jhc-compiled programs uses the encoding of the current locale.
|
|
|
|
- Other implementations perform I/O on bytes treated as characters, i.e. belonging to the Latin-1 subset.
|
|
|
|
|
|
|
|
|
|
|
|
Assuming we retain Unicode as the representation of `Char`:
|
|
|
|
|
|
|
|
|
|
|
|
- Flexible handling of character encodings will be needed, but there is no existing implementation. Should we specify it or leave room for experimentation?
|
|
|
|
- [BinaryIO](binary-io) is needed anyway, and would provide a base for these encodings.
|
|
|
|
- A simple character-based I/O and system interface like that in Haskell 98, possibly taking defaults from the locale, will also be convenient for many users. However it might not be in the Prelude if we [shrink the Prelude](prelude).
|
| ... | ... | @@ -27,8 +33,10 @@ Assuming we retain Unicode as the representation of `Char`: |
|
|
|
## Strings in System functions
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Native system calls use varying representations of strings:
|
|
|
|
|
|
|
|
|
|
|
|
- Unix-like systems and many others use byte strings, which may use various encodings (or may not be character data at all).
|
|
|
|
- The NTFS file system (Windows) stores filenames in UTF-16, and the Win32 interface provides functions using UTF-16. Since Windows NT, the byte-level interface is a compatibility layer over UTF-16.
|
|
|
|
|
| ... | ... | @@ -36,6 +44,7 @@ Native system calls use varying representations of strings: |
|
|
|
Haskell 98 defines `FilePath` as `String` (used in the [ Prelude](http://www.haskell.org/onlinereport/standard-prelude.html#preludeio), [ IO](http://www.haskell.org/onlinereport/io.html) and [ Directory](http://www.haskell.org/onlinereport/directory.html) modules).
|
|
|
|
The functions in [ System](http://www.haskell.org/onlinereport/system.html) use `String` for program arguments and environment values.
|
|
|
|
|
|
|
|
|
|
|
|
- Hugs exchanges byte-strings using a byte encoding of Unicode determined by the current locale.
|
|
|
|
- Other implementations treat the byte-strings interchanged with the operation system as characters, i.e. belonging to the Latin-1 subset.
|
|
|
|
- The [ForeignFunctionInterface](foreign-function-interface) specifies `CString` functions that perform locale-based conversion, but these are not yet provided by the Haskell implementations.
|
| ... | ... | @@ -43,8 +52,10 @@ The functions in [ System](http://www.haskell.org/onlinereport/system.html) use |
|
|
|
|
|
|
|
A disadvantage or using encodings is that some byte-strings may not be legal encodings, e.g. using a program argument as a filename may fail. Converting to `String` and back may also lose distinctions for some encodings. On the other hand, byte-strings are inappropriate if the underlying system uses a form of Unicode (e.g. recent Windows, and possibly more systems in the future). One way out would be to provide an abstract type for strings in O/S form. Again, the old character interface would remain useful for many.
|
|
|
|
|
|
|
|
|
|
|
|
## A Straw-Man Proposal
|
|
|
|
|
|
|
|
|
|
|
|
- **I/O.**
|
|
|
|
All raw I/O is in terms of octets, i.e. `Word8`
|
|
|
|
- **Conversions.**
|
| ... | ... | |