Unicode in CLI support is very weird on Windows
Summary
On Windows,
getLine fails to read unicode characters even when codepage and stdin encoding is set to UTF-8 and putStrLn fails to write unicode characters when codepage is not set, even though it could.
Steps to reproduce
Type the following in a GHCi session:
getLine
and then enter a character like "µ" or "ẞ". If you have entered a single character, the result will be this:
"\NUL"
Alternatively, write and compile a .hs
file with this content:
main :: IO ()
main = getLine >>= putStrLn
When running this without writing chcp 65001
or setting the "UTF-8 by default" setting beforehand, entering unicode characters will result in an error (note: this behaviour is assumed by me to persist in compiled code, but has only been reproduced by me in GHCi so far):
*** Exception: <stdout>: hPutChar: invalid argument (invalid character)
or print a ?
character instead (note: this behaviour is not assumed by me to be the default, as, for testing purposes, I had set my codepage to an unfamiliar one).
When the codepage is set, entering unsupported characters leads to them being replaced with a space:
HelloµHello
Hello Hello
Adding hSetEncoding stdin utf8
to any of the above examples does not change the behaviour.
Running hGetEncoding stdin
in GHCi when the codepage is 65001 gives:
Just UTF-8
but the incorrect behaviour occurs nonetheless.
Expected behavior
For the getLine
example:
Entering µ
should return
"\181"
For the compiled code example:
The string should be returned unchanged
Environment
- GHC version used: 8.10.1
Optional:
- Operating System: Windows 10
- System Architecture: 64-Bit
Important
This is only the case on Windows.
This works fine on Linux or on WSL.