Skip to content

Unicode in CLI support is very weird on Windows

Summary

On Windows,
getLine fails to read unicode characters even when codepage and stdin encoding is set to UTF-8 and putStrLn fails to write unicode characters when codepage is not set, even though it could.

Steps to reproduce

Type the following in a GHCi session:

getLine

and then enter a character like "µ" or "ẞ". If you have entered a single character, the result will be this:

"\NUL"

Alternatively, write and compile a .hs file with this content:

main :: IO ()
main = getLine >>= putStrLn

When running this without writing chcp 65001 or setting the "UTF-8 by default" setting beforehand, entering unicode characters will result in an error (note: this behaviour is assumed by me to persist in compiled code, but has only been reproduced by me in GHCi so far):

*** Exception: <stdout>: hPutChar: invalid argument (invalid character)

or print a ? character instead (note: this behaviour is not assumed by me to be the default, as, for testing purposes, I had set my codepage to an unfamiliar one).

When the codepage is set, entering unsupported characters leads to them being replaced with a space:

HelloµHello
Hello Hello

Adding hSetEncoding stdin utf8 to any of the above examples does not change the behaviour.

Running hGetEncoding stdin in GHCi when the codepage is 65001 gives:

Just UTF-8

but the incorrect behaviour occurs nonetheless.

Expected behavior

For the getLine example:

Entering µ should return

"\181"

For the compiled code example:

The string should be returned unchanged

Environment

  • GHC version used: 8.10.1

Optional:

  • Operating System: Windows 10
  • System Architecture: 64-Bit

Important

This is only the case on Windows.

This works fine on Linux or on WSL.

Edited by Anselm Schüler
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information