Changes

malcolm.wallace@cs.york.ac.uk · 7892a2fe
--- a/unicode.md
+++ b/unicode.md
@@ -85,3 +85,70 @@ Assuming we retain Unicode as the representation of `Char`:
 - [BinaryIO](binary-io) is needed anyway, and would provide a base for these encodings.
 - A simple character-based I/O and system interface like that in Haskell 98, possibly taking defaults from the locale, will also be convenient for many users.
 - An abstract type may be needed for data in O/S form, such as filenames, program arguments and the environment.
+
+## A Straw-Man Proposal
+
+
+- **Internal character representation.**
+  The Haskell type `Char` is UCS-4.
+- **Haskell source encoding.**
+
+  - Introduce a pragma `{-# ENCODING e #-`} with a range of possible
+    values of the encoding `e`.  If the pragma is present, it must be
+    at the beginning of the file.  If it is not present, the file is
+    encoded in Latin-1.  Note that even if the pragma is present, some
+    heuristic may be needed even to get as far as interpreting the
+    encoding declaration, like in
+    [ XML](http://www.w3.org/TR/2004/REC-xml11-20040204/#sec-guessing).
+    The fact that the first three characters must be `{-#` will be
+    useful here.
+  - A literal string may contain any literal character representable in the
+    source encoding.  In addition, escapes are provided to permit the specification of
+    *any* Unicode character (which may or may not be otherwise
+    representable in the source encoding).
+  - An identifier may contain any Unicode alphanumeric or symbol
+    characters from a defined range.  Thus, a source text may not be
+    representable in certain other encodings (especially in ASCII).
+- **I/O.**  
+  All raw I/O is in terms of octets, i.e. `Word8`
+- **Conversions.**
+  Pure functions exist to convert octets to and from any particular encoding:
+
+  ```wiki
+     stringDecode :: Encoding -> [Word8] -> [Char]
+     stringEncode :: Encoding -> [Char] -> [Word8]
+  ```
+
+  The codecs must operate on strings, not individual characters, because some
+  encodings use variable-length sequences of octets.
+- **Efficiency.**
+  Semantically, character-based I/O is a simple composition of the raw 
+  I/O primitives with an encoding conversion function.  However, for
+  efficiency, an implementation might choose to provide certain encoded
+  I/O operations primitively.  If such primitives are exposed to the 
+  user, they should have standard names so that other implementations can
+  provide the same functionality in pure Haskell Prime.
+- **Locales.**
+  It may be possible to retain the traditional I/O signatures for
+  hGetChar, hPutChar, readFile, writeFile, etc, but only by introducing
+  a stateful notion of *current encoding* associated with each
+  individual handle.  The default encoding could be inherited from the
+  operating system environment, but it should also be possible to
+  change the encoding explicitly.
+
+  ```wiki
+     getIOEncoding :: Handle -> IO Encoding
+     setIOEncoding :: Encoding -> Handle -> IO ()
+     resetIOEncoding :: Handle -> IO ()  -- go back to default
+  ```
+- **Filenames, program arguments, environment.**
+
+  - Filenames are stored in Haskell as `[Char]`, but the operating
+    system should receive `[Word8]` for any I/O using filenames.
+    Some encoding conversion is therefore required.  Usually, this will
+    be platform-dependent, and so the actual encoding may be hidden
+    from the programmer as part of the default locale.
+  - Program arguments, and symbols from the environment, are supplied
+    by the operating system to the Haskell program as `[Word8]`.
+    The program is responsible for conversion to `[Char]`.  Again,
+    there may be a default encoding chosen based on the locale.