|
|
# Unicode in Haskell source
|
|
# Unicode in Haskell source
|
|
|
|
|
|
|
|
|
|
|
|
|
The Haskell 98 Report ([ Lexical Structure](http://www.haskell.org/onlinereport/lexemes.html)) claims that Haskell source code uses the [Unicode](unicode) character set.
|
|
|
|
|
|
|
|
|
|
## Current support for Unicode in source files
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The Haskell 98 Report ([ Lexical Structure](http://www.haskell.org/onlinereport/lexemes.html)) claims that Haskell source code uses the [Unicode](unicode) character set.
|
|
|
Haskell source code is stored in text files using various character sets and encodings.
|
|
Haskell source code is stored in text files using various character sets and encodings.
|
|
|
|
If Unicode were allowed, how would implementations know which encoding was used?
|
|
|
|
|
|
|
|
|
|
|
|
- Jhc allows unrestricted use of the Unicode character set in Haskell source, treating input as UTF-8. Several uses of Unicode characters in place of Haskell keywords are permitted:
|
|
- Jhc allows unrestricted use of the Unicode character set in Haskell source, treating input as UTF-8. Several uses of Unicode characters in place of Haskell keywords are permitted:
|
|
|
|
|
|
| ... | @@ -20,20 +19,12 @@ Haskell source code is stored in text files using various character sets and enc |
... | @@ -20,20 +19,12 @@ Haskell source code is stored in text files using various character sets and enc |
|
|
|
|
|
|
|
In addition there is experimental support for defining new operators and names using various Unicode characters.
|
|
In addition there is experimental support for defining new operators and names using various Unicode characters.
|
|
|
- Hugs treats input as being in the encoding specified by the current locale, but permits Unicode only in comments and character and string literals.
|
|
- Hugs treats input as being in the encoding specified by the current locale, but permits Unicode only in comments and character and string literals.
|
|
|
- GHC now (as of early Jan 2006) interprets source files as UTF-8. In `-fglasgow-exts` mode the above special symbols are interpreted as in JHC. GHC knows about the characters classifications of all unicode characters via the Data.Char library, and can therefore understand identifiers written using alphanumeric characters from any language (but see below for note about caseless character sets).
|
|
|
|
|
- Others treat source code as ISO 8858-1 (Latin-1).
|
|
- Others treat source code as ISO 8858-1 (Latin-1).
|
|
|
|
|
|
|
|
## Problems with Unicode in Haskell 98
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
There are plenty of Unicode alphabetic characters which are neither upper, lower, or title case, and hence are not allowed in identifiers. Some languages have no notion of case at all. Since Haskell's syntax relies on case for distinguishing constructors and variables, what should our position be with respect to caseless character sets?
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The report should at least be absolutely clear about which Unicode character properties (N, Ll, Lu, Sm, etc.) correspond to which lexical class in the syntax.
|
|
Some things we could do:
|
|
|
|
|
|
|
|
## Some things we could do
|
|
|
|
|
|
|
|
|
|
- Allow Unicode, with detection for the two common encodings (UTF-8 and UTF-16). See [SourceEncodingDetection](source-encoding-detection) for a proposal.
|
|
|
|
|
- Revert to US-ASCII, Latin-1 or implementation-defined character sets.
|
|
- Revert to US-ASCII, Latin-1 or implementation-defined character sets.
|
|
|
- Allow Unicode with the encoding specified outside source files (e.g. by the current locale, as currently done by Hugs). This would make Haskell source containing non-ASCII characters non-portable.
|
|
- Allow Unicode with the encoding specified outside source files (e.g. by the current locale, as currently done by Hugs). This would make Haskell source containing non-ASCII characters non-portable.
|
|
|
- Allow Unicode, with a mechanism for specifying encoding in the source file, e.g.
|
|
- Allow Unicode, with a mechanism for specifying encoding in the source file, e.g.
|
| ... | @@ -52,14 +43,21 @@ The report should at least be absolutely clear about which Unicode character pro |
... | @@ -52,14 +43,21 @@ The report should at least be absolutely clear about which Unicode character pro |
|
|
|
|
|
|
|
If Unicode is allowed, should its use be restricted?
|
|
If Unicode is allowed, should its use be restricted?
|
|
|
|
|
|
|
|
|
|
|
|
- Haskell 98 already has character escapes for arbitrary Unicode characters in character and string literals. Thus Unicode in these literals can always be transformed into a portable form.
|
|
- Haskell 98 already has character escapes for arbitrary Unicode characters in character and string literals. Thus Unicode in these literals can always be transformed into a portable form.
|
|
|
- Haskell 98 permits upper, title and lower case alphabetic characters (but not other alphabetic characters) in identifiers, and symbol or punctuation characters in symbols. Thus a source text may not be representable in all encodings (especially ASCII).
|
|
- Haskell 98 permits upper, title and lower case alphabetic characters (but not other alphabetic characters) in identifiers, and symbol or punctuation characters in symbols. Thus a source text may not be representable in all encodings (especially ASCII).
|
|
|
|
|
|
|
|
|
|
|
|
|
It is not reasonable to display all Unicode characters with the same width, but the Haskell 98 Report ([ Layout](http://www.haskell.org/onlinereport/syntax-iso.html#layout)) says:
|
|
It is not reasonable to display all Unicode characters with the same width, but the Haskell 98 Report ([ Layout](http://www.haskell.org/onlinereport/syntax-iso.html#layout)) says:
|
|
|
|
|
|
|
|
|
|
|
|
|
>
|
|
|
>
|
|
>
|
|
|
> For the purposes of the layout rule, Unicode characters in a source program are considered to be of the same, fixed, width as an ASCII character. However, to avoid visual confusion, programmers should avoid writing programs in which the meaning of implicit layout depends on the width of non-space characters.
|
|
> For the purposes of the layout rule, Unicode characters in a source program are considered to be of the same, fixed, width as an ASCII character. However, to avoid visual confusion, programmers should avoid writing programs in which the meaning of implicit layout depends on the width of non-space characters.
|
|
|
|
>
|
|
|
|
>
|
|
|
|
|
|
|
|
|
|
|
|
|
Is this adequate?
|
|
Is this adequate?
|
|
|
|
|
|
|
|
|