Support Unicode characters in instance Show String

added needs triage label

removed needs triage label

added core libraries label

added Tfeature request label

/cc @core-libraries

changed the description

added Plow label

I was not even aware that the Show instance for String behaved as it currently does. I am in favor of this change. I do not understand what Alternative 2 means:

Make the Show instance escape unreadable strings. This increases robustness of the interaction between show and read, but it creates a hard dependency on Unicode, and thus affects portability.

GHC already has a "hard dependency on Unicode". The value backing Char is a Unicode code point.

GHC already has a "hard dependency on Unicode". The value backing Char is a Unicode code point.

Oh, I was under the wrong impression! This sounds like good news: we can have Python-style repr in base.

I am strongly in favour of this change.

The key question is which characters to escape, since at least backslash and double-quotes need escaping, and also ASCII controls (escape sequences should not mess up the screen, ...).

One then could run into complications when (despite all the other code pages being obsolete) the user's terminal is not UTF-8, or the "string" contains invalid Unicode code points. With print, et. al., this is presumably handled by correctly setting the encoding property of stdout.

Complications can also arise on Unix if the tty flag bits are set to enable parity or cs7, ... I don't know whether GHCi puts the terminal into a sane 8-bit clean mode. While we're here, I notice that typing 'Ctrl-Z' when GHCi has spawned 'vi' via ":edit" to run a shell command, and then resuming leaves "vi" in the wrong mode, because "ghci" fails to save/restore the terminal mode on suspend/resume. That's I guess worth a separate issue...

When Ghci is doing the display implicitly, I don't know what controls the encoding, but that too should be solvable.

In the final analysis I am supportive of the proposal (IIRC there are older issues suggesting much the same change), but there are some subtleties to take care of.

I like the idea of valid, non-ASCII unicode strings showing without escaping.

I think we'd still want to escape invalid unicode. We probably wouldn't want to pass surrogate characters through, for example, without escaping them.

(Side-note: I would not be against a hard dependency on UTF-8 terminals for GHCi. From my experience, systems that are absolutely unable of having UTF-8 locales like C.UTF-8 are already de facto excluded from running GHC-based programs for a myriad of other reasons.)

I don't know whether GHCi puts the terminal into a sane 8-bit clean mode.

Maybe we can make sure of that!

I am supportive of the proposal (IIRC there are older issues suggesting much the same change), but there are some subtleties to take care of.

So am I, and I fully agree that we should dedicate the appropriate amount of resources to nail those.

So long as the C.UTF-8 (I use LANG=C LC_CTYPE=en_US.UTF-8) requirement is explicitly stated and acceptable to the community, I am all for it, but only because I don't personally experience this as a limitation. Perhaps there are in fact users of Haskell programs who don't have access to a UTF-8 locale? I simply don't know how likely that is, and what work-arounds are available to them...

As a reference, here is what Python seem to be doing since 2016: https://www.python.org/dev/peps/pep-0538/#abstract

Disclaimer: I am not a Windows user, but have remotely helped Haskell beginners on Windows in the past.

From difficulties experienced by said beginners, I understand that the default Windows terminal uses a legacy codepage (437) by default (presumably for backwards compatibility). This means that output not representable in CP437 would result in IO errors (as the stdout handle’s encoding is correctly initialized to CP437) unless the terminal codepage is manually changed to UTF-8 (with chcp 65001) beforehand.

I expect that the difference in functionality is too minor (mostly GHCi displaying string results), and I don’t even know what sort of environment the current recommended guides for GHC on Windows set up; I just thought it might be worth mentioning. If the suggested requirement for a proper UTF-8-capable environment is instituted, any such guides should make sure to include instructions to meet it.

Edit: The error that appears/appeared is <stdout>: commitBuffer: invalid argument (invalid character). Given how unfriendly this is, and how likely it is to arise in a non-UTF-8 environment (the localeEncoding is used by default for files as well as stdout), ensuring a proper UTF-8-capable environment is probably the better play as long as a beginner can be reasonably instructed to do so.

The story may (I hope) be improved with the new Windows IO manager in 9.x, @Phyx would probably know what obstacles (if any remain at this point).

WinIO should have no issues with Unicode at all. Unicode support was one of the reasons for the new I/O manager. Do note however that it's not on by default yet.

Can you comment on the likely timing of the new IO manager becoming the default? Do you have any thoughts about the wisdom or otherwise of displaying unicode text as-is? (What is the correct notion of printable? Are concerns around "confusable" code points relevant here? ...)

There are two things here:

Making winio default for GHC: I think we're mostly there.. Just needs me to find time to fix the last remaining bug.
Make winio the default for end users: requires work on Network. I was originally planning on going for a highly optimized implementation to begin with. But given the time required I will simply go for a working one for now. But no ETA here unless I get some help.

Do you have any thoughts about the wisdom or otherwise of displaying unicode text as-is?

GHC/Base is unicode internally already, but the representation is not efficient for all platforms. If unicode were to become the default for input/output I suggest (aside from needing winio) that the internal buffers when on windows be changed to UTF-16.

Otherwise windows users will pay a double encoding penalty.

(What is the correct notion of printable? Are concerns around "confusable" code points relevant here? ...)

What do you mean with confusable codepoints?

Note that the topic here is not the internal representation of unicode data, but rather which characters are escaped by show @String, or equivalently the underlying showLitChar function. How the input or the result is stored is at a layer below this.

There is of course a move underway to move Text to UTF-8, which could I guess incur some sort of penalty on Windows, but with the world increasingly moving to UTF-8 it seems this is something that's generally applicable beyond GHC. So I think that ship has sailed...

As to "confusable" codepoints, this is about unicode glyphs that look the same to human readers, even though the underlying codepoints are different. Despite appearances, the three strings below have no characters in common:

cosmos
космос
κοσμοσ

As can bee seen via:

λ> "cosmos"
"cosmos"
λ> "космос"
"\1082\1086\1089\1084\1086\1089"
λ> "κοσμοσ"
"\954\959\963\956\959\963"

Thank you very much for this precision! We might indeed want to unify the environments in which GHCi runs. Maybe automatically handling the chcp would be an option in order to lower the barrier to entry for new (Windows) users.

WinIO already handles the console codepage transparently.

Unless the user has explicitly changed the codepage to something else a haskell program running with winio will default to a unicode codepage.

Yes, please. I pushed back this traumatic experience of my youth, but nothing is more disheartening for a learner than

> foo x = if odd x then "Нечетное!" else "Четное!"
> foo 5
"\1053\1077\1095\1077\1090\1085\1086\1077!"

This seems like a breaking change that will break code, potentially a lot. Is that really worth it? Did anyone investigate the potential fallout?
I currently cannot check, because all mobile repls are broken, but will this not require changing read as well?

Support Unicode characters in instance Show String

Motivation

Proposal

Alternative Options

Child items ...

Activity

Support Unicode characters in instance Show String

Motivation

Proposal

Alternative Options

Relates to

Activity