Support Unicode characters in instance Show String
This feature request does not ask to change the language, but change the behavior of
Unicode has been widely adopted and people around the world rely on Unicode to write in their native languages. Haskell, however, has been stuck in ASCII, and escape all non-ASCII characters in the
String's instance of the
Show class, despite the fact that each element of a
String is typically a Unicode code point, and
putStrLn actually works as expected. Consider the following examples:
ghci> print "Hello, 世界" "Hello, \19990\30028" ghci> print "Hello, мир" "Hello, \1084\1080\1088" ghci> print "Hello, κόσμος" "Hello, \954\972\963\956\959\962" ghci> "Hello, 世界" -- ghci calls `show`, so string literals are also escaped "Hello, \19990\30028" ghci> "😀" -- Not only human scripts, but also emojis! "\128512"
This status quo is unsatisfactory for a number of reasons:
- Even though it's small, it somehow creates an unwelcoming atmosphere for native speakers of languages whose scripts are not representable in ASCII.
- This is an actual annoyance during debugging localized software, or strings with emojis.
- Following 1, Haskell teachers are forced to use other languages instead of the students' mother tongues, or relying on I/O functions like
putStrLn, creating a rather unnecessary burden.
- Other string types, like Text, rely on this
read already can handle Unicode strings today, so relaxing constraints on
show doesn't affect
read . show == id.
By not escaping Unicode characters in the instance of
String, all of the above problems go away immediately, and other string libraries benefit as well.
It's proposed here to change the
Show instance of
String, to achieve the following output:
ghci> print "Hello, 世界" "Hello, 世界" ghci> print "Hello, мир" "Hello, мир" ghci> print "Hello, κόσμος" "Hello, κόσμος" ghci> "Hello, 世界" "Hello, 世界" ghci> "😀" "😀"
Concretely, it means:
- Remove a few guards in
GHC.Show) to not escape Unicode characters (except \n, etc.).
- Provide a function
newtype Escaped = Escaped String, to obtain the current escaping behavior, in case anyone wants that escaping back.
Stringoutputs and escape any unreadable characters to avoid printing gibberish to the user's terminal.
Show is usually only used in debugging and in development, this change should not affect any production system.
putStrLn. This is viable today but unsatisfactory, as it requires
stdout, but in some cases the
stdoutis not really accessible, e.g. Telegram or Discord eval bots.
Showinstance escape unreadable strings. This increases robustness of the interaction between
read, but it creates a hard dependency on Unicode, and thus affects portability.
ghciintercepts string outputs and check if they can be converted back to readable Unicode characters. This is another viable option. But the
unescapingapproach described above is preferable, because in today's GHC Haskell,
showactually drops information from the strings, and the following wouldn't work as we'd like:
ghci> Prelude.putStrLn $ show "世界" "\19990\30028" -- we'd like unescaped "世界" here