Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Sign in / Register
  • GHC GHC
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 5,244
    • Issues 5,244
    • List
    • Boards
    • Service Desk
    • Milestones
    • Iterations
  • Merge requests 567
    • Merge requests 567
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
    • Test Cases
  • Deployments
    • Deployments
    • Releases
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Code review
    • Insights
    • Issue
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Glasgow Haskell CompilerGlasgow Haskell Compiler
  • GHCGHC
  • Issues
  • #20027
Closed
Open
Issue created Jun 21, 2021 by ksqsf@ksqsf

Support Unicode characters in instance Show String

This feature request does not ask to change the language, but change the behavior of base.

Motivation

Unicode has been widely adopted and people around the world rely on Unicode to write in their native languages. Haskell, however, has been stuck in ASCII, and escape all non-ASCII characters in the String's instance of the Show class, despite the fact that each element of a String is typically a Unicode code point, and putStrLn actually works as expected. Consider the following examples:

ghci> print "Hello, 世界"
"Hello, \19990\30028"

ghci> print "Hello, мир"
"Hello, \1084\1080\1088"

ghci> print "Hello, κόσμος"
"Hello, \954\972\963\956\959\962"

ghci> "Hello, 世界"       -- ghci calls `show`, so string literals are also escaped
"Hello, \19990\30028"

ghci> "😀"  -- Not only human scripts, but also emojis!
"\128512"

This status quo is unsatisfactory for a number of reasons:

  1. Even though it's small, it somehow creates an unwelcoming atmosphere for native speakers of languages whose scripts are not representable in ASCII.
  2. This is an actual annoyance during debugging localized software, or strings with emojis.
  3. Following 1, Haskell teachers are forced to use other languages instead of the students' mother tongues, or relying on I/O functions like putStrLn, creating a rather unnecessary burden.
  4. Other string types, like Text, rely on this Show instance.

Moreover, read already can handle Unicode strings today, so relaxing constraints on show doesn't affect read . show == id.

By not escaping Unicode characters in the instance of Show for String, all of the above problems go away immediately, and other string libraries benefit as well.

Proposal

It's proposed here to change the Show instance of String, to achieve the following output:

ghci> print "Hello, 世界"
"Hello, 世界"

ghci> print "Hello, мир"
"Hello, мир"

ghci> print "Hello, κόσμος"
"Hello, κόσμος"

ghci> "Hello, 世界"      
"Hello, 世界"

ghci> "😀" 
"😀"

Concretely, it means:

  1. Remove a few guards in showLitChar (in GHC.Show) to not escape Unicode characters (except \n, etc.).
  2. Provide a function showEscaped, or newtype Escaped = Escaped String, to obtain the current escaping behavior, in case anyone wants that escaping back.
  3. ghci should intercept String outputs and escape any unreadable characters to avoid printing gibberish to the user's terminal.

Since Show is usually only used in debugging and in development, this change should not affect any production system.

Alternative Options

  1. Always use putStrLn. This is viable today but unsatisfactory, as it requires stdout, but in some cases the stdout is not really accessible, e.g. Telegram or Discord eval bots.
  2. Make the Show instance escape unreadable strings. This increases robustness of the interaction between show and read, but it creates a hard dependency on Unicode, and thus affects portability.
  3. Customize ghci instead. ghci intercepts string outputs and check if they can be converted back to readable Unicode characters. This is another viable option. But the unescaping approach described above is preferable, because in today's GHC Haskell, show actually drops information from the strings, and the following wouldn't work as we'd like:
ghci> Prelude.putStrLn $ show "世界"
"\19990\30028"  -- we'd like unescaped "世界" here
Edited Jun 21, 2021 by ksqsf
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information
Assignee
Assign to
Time tracking