| ... | ... | @@ -85,3 +85,70 @@ Assuming we retain Unicode as the representation of `Char`: |
|
|
|
- [BinaryIO](binary-io) is needed anyway, and would provide a base for these encodings.
|
|
|
|
- A simple character-based I/O and system interface like that in Haskell 98, possibly taking defaults from the locale, will also be convenient for many users.
|
|
|
|
- An abstract type may be needed for data in O/S form, such as filenames, program arguments and the environment.
|
|
|
|
|
|
|
|
## A Straw-Man Proposal
|
|
|
|
|
|
|
|
|
|
|
|
- **Internal character representation.**
|
|
|
|
The Haskell type `Char` is UCS-4.
|
|
|
|
- **Haskell source encoding.**
|
|
|
|
|
|
|
|
- Introduce a pragma `{-# ENCODING e #-`} with a range of possible
|
|
|
|
values of the encoding `e`. If the pragma is present, it must be
|
|
|
|
at the beginning of the file. If it is not present, the file is
|
|
|
|
encoded in Latin-1. Note that even if the pragma is present, some
|
|
|
|
heuristic may be needed even to get as far as interpreting the
|
|
|
|
encoding declaration, like in
|
|
|
|
[ XML](http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-guessing).
|
|
|
|
The fact that the first three characters must be `{-#` will be
|
|
|
|
useful here.
|
|
|
|
- A literal string may contain any literal character representable in the
|
|
|
|
source encoding. In addition, escapes are provided to permit the specification of
|
|
|
|
*any* Unicode character (which may or may not be otherwise
|
|
|
|
representable in the source encoding).
|
|
|
|
- An identifier may contain any Unicode alphanumeric or symbol
|
|
|
|
characters from a defined range. Thus, a source text may not be
|
|
|
|
representable in certain other encodings (especially in ASCII).
|
|
|
|
- **I/O.**
|
|
|
|
All raw I/O is in terms of octets, i.e. `Word8`
|
|
|
|
- **Conversions.**
|
|
|
|
Pure functions exist to convert octets to and from any particular encoding:
|
|
|
|
|
|
|
|
```wiki
|
|
|
|
stringDecode :: Encoding -> [Word8] -> [Char]
|
|
|
|
stringEncode :: Encoding -> [Char] -> [Word8]
|
|
|
|
```
|
|
|
|
|
|
|
|
The codecs must operate on strings, not individual characters, because some
|
|
|
|
encodings use variable-length sequences of octets.
|
|
|
|
- **Efficiency.**
|
|
|
|
Semantically, character-based I/O is a simple composition of the raw
|
|
|
|
I/O primitives with an encoding conversion function. However, for
|
|
|
|
efficiency, an implementation might choose to provide certain encoded
|
|
|
|
I/O operations primitively. If such primitives are exposed to the
|
|
|
|
user, they should have standard names so that other implementations can
|
|
|
|
provide the same functionality in pure Haskell Prime.
|
|
|
|
- **Locales.**
|
|
|
|
It may be possible to retain the traditional I/O signatures for
|
|
|
|
hGetChar, hPutChar, readFile, writeFile, etc, but only by introducing
|
|
|
|
a stateful notion of *current encoding* associated with each
|
|
|
|
individual handle. The default encoding could be inherited from the
|
|
|
|
operating system environment, but it should also be possible to
|
|
|
|
change the encoding explicitly.
|
|
|
|
|
|
|
|
```wiki
|
|
|
|
getIOEncoding :: Handle -> IO Encoding
|
|
|
|
setIOEncoding :: Encoding -> Handle -> IO ()
|
|
|
|
resetIOEncoding :: Handle -> IO () -- go back to default
|
|
|
|
```
|
|
|
|
- **Filenames, program arguments, environment.**
|
|
|
|
|
|
|
|
- Filenames are stored in Haskell as `[Char]`, but the operating
|
|
|
|
system should receive `[Word8]` for any I/O using filenames.
|
|
|
|
Some encoding conversion is therefore required. Usually, this will
|
|
|
|
be platform-dependent, and so the actual encoding may be hidden
|
|
|
|
from the programmer as part of the default locale.
|
|
|
|
- Program arguments, and symbols from the environment, are supplied
|
|
|
|
by the operating system to the Haskell program as `[Word8]`.
|
|
|
|
The program is responsible for conversion to `[Char]`. Again,
|
|
|
|
there may be a default encoding chosen based on the locale. |