GHC issueshttps://gitlab.haskell.org/ghc/ghc/-/issues2024-03-19T14:52:54Zhttps://gitlab.haskell.org/ghc/ghc/-/issues/24549'<stdin>: hGetLine: invalid argument' with Unicode input on Windows2024-03-19T14:52:54ZSiyuan Chen'<stdin>: hGetLine: invalid argument' with Unicode input on Windows## Summary
On Windows, when handling Unicode input it would report an error.
Due to the probabilistic nature of this issue and different locales, I have to narrow it as much as possible (see below). Nevertheless, I'm not sure if it can...## Summary
On Windows, when handling Unicode input it would report an error.
Due to the probabilistic nature of this issue and different locales, I have to narrow it as much as possible (see below). Nevertheless, I'm not sure if it can be 100% reproducible on your side.
## Steps to reproduce
Change "Current language for non-Unicode programs:" to "Chinese (Simplified, China)" and don't check "Beta: UTF-8". (It's might be not necessary, but can narrow the problem).
Create a new project (named testunicode).
```
-- Main.hs
module Main where
import System.IO
main :: IO ()
main = do
i_enc <- hGetEncoding stdin
o_enc <- hGetEncoding stdout
e_enc <- hGetEncoding stderr
putStrLn $ show i_enc
putStrLn $ show o_enc
putStrLn $ show e_enc
line <- getLine
putStrLn line
```
Ensure there is only one conhost.exe in your task manager (It might be not necessary, but can narrow the problem).
Open `cmd.exe`, set `PATH` for GHC and cabal (DON'T launch by a `.cmd` script, it might cause to unable to reproduce!)
Ensure console Font is "新宋体" (Note that if you select other Font, there would be no error but another problem occurs... see below)
```
C:\work-pl\haskell\testunicode>chcp
Active code page: 936
C:\work-pl\haskell\testunicode>chcp 65001
Active code page: 65001 (This will clear the console)
C:\work-pl\haskell\testunicode>cabal build
... [2 of 2] Linking ... \\testunicode.exe
$ cabal run
C:\work-pl\haskell\testunicode>cabal run
Just UTF-8
Just UTF-8
Just UTF-8
Фывфыв
testunicode: <stdin>: hGetLine: invalid argument (cannot decode byte sequence starting from 208)
```
If can not reproduce (this might be because your code page defaults to UTF-8), do the following:
```
C:\work-pl\haskell\testunicode>chcp 936
Active code page: 936
C:\work-pl\haskell\testunicode>chcp 65001
Active code page: 65001
C:\work-pl\haskell\testunicode>cabal run
Just UTF-8
Just UTF-8
Just UTF-8
Фывфыв
testunicode: <stdin>: hGetLine: invalid argument (cannot decode byte sequence starting from 208)
```
This should reproduce (DON't close the console window, see below).
It seems like a Windows BUG because sometimes no problem, for example, if you launch the console by a `.cmd` script and console's active code page defaults to UTF-8. However, I don't believe it is a Windows BUG, because some other PL is OK, e.g. Racket.
Even it works perfectly in `io-manager=native` mode.
For example, edit testunicode.cabal and add `ghc-options: -rtsopts`.
```
C:\work-pl\haskell\testunicode>cabal clean
C:\work-pl\haskell\testunicode>cabal build
... [2 of 2] Linking ... \\testunicode.exe
C:\work-pl\haskell\testunicode>cabal run testunicode -- +RTS --io-manager=native
Just UTF-8
Just UTF-8
Just UTF-8
Фывфыв
Фывфыв
```
It works perfectly.
However, the `--io-manager=native` can not work normally in REPL.
```
C:\work-pl\haskell\testunicode>cabal repl testunicode --repl-options="+RTS --io-manager=native"
ghci> main
Just UTF-8
Just UTF-8
Just UTF-8
<----- STUCK HERE!!!
```
Note that this issue is only for input. No problem with output. For example, if Main.hs has no `getLine` and just `putStrLn "Фывфыв"`, it works perfectly.
##### If you select Font "Lucida Console" instead of "新宋体".
Then no error occurs, but show nothing.
```
C:\work-pl\haskell\testunicode>cabal run
Just UTF-8
Just UTF-8
Just UTF-8
Фывфыв
<----- NOTHING!
```
However, the Lucida Console definitely can render "Фывфыв", see https://www.myfonts.com/collections/lucida-console-font-monotype-imaging
Thanks.
Related issues: https://gitlab.haskell.org/ghc/ghc/-/issues/10542 https://gitlab.haskell.org/ghc/ghc/-/issues/18307
## Expected behavior
Handle Unicode input correctly.
## Environment
* GHC version used: 9.8.19.8.3https://gitlab.haskell.org/ghc/ghc/-/issues/22183Update to Unicode 15.0.02022-09-20T14:23:34ZPierre Le MarreUpdate to Unicode 15.0.0[Unicode 15.0.0](https://www.unicode.org/versions/Unicode15.0.0/) has been published on 13th September 2022.[Unicode 15.0.0](https://www.unicode.org/versions/Unicode15.0.0/) has been published on 13th September 2022.https://gitlab.haskell.org/ghc/ghc/-/issues/21228GHC Lexer reports error in string literal including ZWJ2023-03-15T13:04:52ZNobuo YamashitaGHC Lexer reports error in string literal including ZWJ## Summary
GHC Lexer reports error in string literal including ZWJ.
## Steps to reproduce
1. Provide the following file.
```foo.hs
main = do
putStrLn "\128104\8205\128105\8205\128103\8205\128102"
putStrLn "👨👩👧👦"
```
2. Try ...## Summary
GHC Lexer reports error in string literal including ZWJ.
## Steps to reproduce
1. Provide the following file.
```foo.hs
main = do
putStrLn "\128104\8205\128105\8205\128103\8205\128102"
putStrLn "👨👩👧👦"
```
2. Try to compile it and get a lexical error at ZWJ.
```
$ ghc foo.hs
[1 of 1] Compiling Main ( foo.hs, foo.o )
foo.hs:3:16: error:
lexical error in string/character literal at character '\8205'
|
3 | putStrLn "👨👩👧👦"
| ^
```
## Expected behavior
We expect same behavior on the both codes of
putStrLn "\128104\8205\128105\8205\128103\8205\128102"
and
putStrLn "👨👩👧👦"
without lexical error.
## Environment
* GHC version used: 8.10.7, 9.0.2 and 9.2.2
Optional:
* Operating System: Ubuntu 20.04
* System Architecture: x86_64ZubinZubinhttps://gitlab.haskell.org/ghc/ghc/-/issues/20353Incorrect encoding for "error" strings on windows2022-11-21T19:54:05ZShrykeWindgraceIncorrect encoding for "error" strings on windows## Summary
Cyrillic characters provided to `error` get mangled; cyrillic characters printed via `hPutStrLn stderr` are ok.
## Steps to reproduce
The following code
```haskell
module Main where
import System.IO
import GHC.IO.Encoding
ma...## Summary
Cyrillic characters provided to `error` get mangled; cyrillic characters printed via `hPutStrLn stderr` are ok.
## Steps to reproduce
The following code
```haskell
module Main where
import System.IO
import GHC.IO.Encoding
main :: IO ()
main = do
putStr "stdout encoding: " >> hGetEncoding stdout >>= print
putStr "stderr encoding: " >> hGetEncoding stderr >>= print
putStr "getForeignEncoding: " >> getForeignEncoding >>= print
putStr "getLocaleEncoding: " >> getLocaleEncoding >>= print
putStr "getFileSystemEncoding: " >> getFileSystemEncoding >>= print
putStrLn "latin"
putStrLn "кириллица 1"
hPutStrLn stderr "кириллица 2"
error "кириллица 3"
```
is built with `ghc-9.0.1` via `stack` and `resolver: nightly-2021-09-09`.
Running it with `stack exec -- nameOfexecutable` (optionally `+RTS --io-manager=native`) yields
```shell
stdout encoding: Just UTF-8
stderr encoding: Just UTF-8
getForeignEncoding: UTF-8
getLocaleEncoding: UTF-8
getFileSystemEncoding: UTF-8
latin
кириллица 1
кириллица 2
codepage-exe.EXE: кириллица 3
CallStack (from HasCallStack):
error, called at app\Main.hs:17:5 in main:Main
```
## Expected behavior
All cyrillic characters are displayed correctly.
## Environment
* GHC version used: 9.0.1
* Terminal: Windows Terminal
* Shell: powershell-7.1.4 with `$OutputEncoding`
```powershell
Preamble :
BodyName : utf-8
EncodingName : Unicode (UTF-8)
HeaderName : utf-8
WebName : utf-8
WindowsCodePage : 1200
IsBrowserDisplay : True
IsBrowserSave : True
IsMailNewsDisplay : True
IsMailNewsSave : True
IsSingleByte : False
EncoderFallback : System.Text.EncoderReplacementFallback
DecoderFallback : System.Text.DecoderReplacementFallback
IsReadOnly : True
CodePage : 65001
```
The git-bash shell exhibits the same behavior.
Optional:
* Operating System: win10
* System Architecture: x86_64
Judging by the form of the above mojibake, it is a utf8-encoded cyrillic interpreted as cp1251. When I add `mkTextEncoding "CP1251" >>= setForeignEncoding` before `error`, everything is displayed correctly.
Is there a setting that would make the program detect a correct encoding (foreign locale?) for `error` output regardless of OS and/or windows flavor? Here the assumption that foreign locale is utf8 fails.
Cheers!Ben GamariBen Gamarihttps://gitlab.haskell.org/ghc/ghc/-/issues/12710Make some reserved Unicode symbols "specials"2019-07-07T18:25:27ZJosh PriceMake some reserved Unicode symbols "specials"The following comment is in `compiler/parser/Lexer.x`:
> ToDo: ideally, → and ∷ should be "specials", so that they cannot
> form part of a large operator. This would let us have a better
> syntax for kinds: ɑ∷\*→\* would be a legal kind...The following comment is in `compiler/parser/Lexer.x`:
> ToDo: ideally, → and ∷ should be "specials", so that they cannot
> form part of a large operator. This would let us have a better
> syntax for kinds: ɑ∷\*→\* would be a legal kind signature. (maybe).
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ----------------- |
| Version | 8.0.1 |
| Type | FeatureRequest |
| TypeOfFailure | OtherFailure |
| Priority | low |
| Resolution | Unresolved |
| Component | Compiler (Parser) |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Make some reserved Unicode symbols \"specials\"","status":"New","operating_system":"","component":"Compiler (Parser)","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"OwnedBy","contents":"JoshPrice247"},"version":"8.0.1","keywords":["Unicode,","UnicodeSyntax"],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"FeatureRequest","description":"The following comment is in `compiler/parser/Lexer.x`:\r\n> ToDo: ideally, → and ∷ should be \"specials\", so that they cannot\r\n> form part of a large operator. This would let us have a better\r\n> syntax for kinds: ɑ∷*→* would be a legal kind signature. (maybe).","type_of_failure":"OtherFailure","blocking":[]} -->Josh PriceJosh Pricehttps://gitlab.haskell.org/ghc/ghc/-/issues/11609Document unicode report deviations2019-07-07T18:29:37ZThomas MiedemaDocument unicode report deviations\@nomeata mentions in #10196:
The report specifies “Haskell compilers are expected to make use of new versions of Unicode as they are made available.” So if we deviate from that, we should make sure that
- the user’s guide explicitly l...\@nomeata mentions in #10196:
The report specifies “Haskell compilers are expected to make use of new versions of Unicode as they are made available.” So if we deviate from that, we should make sure that
- the user’s guide explicitly lists all deviations from the report [in this section](https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/bugs-and-infelicities.html#infelicities-lexical), and
- that the Haskell prime committee is going to be aware of these (sensible) deviations, so that they can become official.
Certain deviations are (there might be more):
- `OtherLetter` are treated as lowercase (#1103), and thus allowed in identifiers.
- `ModifierLetter` (#10196), `OtherNumber` (#4373) and `NonSpacingMark` (#7650) are allowed in identifiers, but only starting from the second character.
- `$decdigit = $ascdigit -- for now, should really be $digit (ToDo)` (see compiler/parser/Lexer.x)https://gitlab.haskell.org/ghc/ghc/-/issues/8730Invalid Unicode Codepoints in Char2022-07-15T06:44:27ZmdmenzelInvalid Unicode Codepoints in CharThe surrogate range in Unicode is supposed to (as of Unicode 2.0, 1996) be a range of invalid code points yet, Data.Char allows the use of values in this range (in fact, it even gives them their own GeneralCategory).
<details><summary>T...The surrogate range in Unicode is supposed to (as of Unicode 2.0, 1996) be a range of invalid code points yet, Data.Char allows the use of values in this range (in fact, it even gives them their own GeneralCategory).
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 7.6.3 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | low |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Invalid Unicode Codepoints in Char","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.6.3","keywords":["unicode"],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"The surrogate range in Unicode is supposed to (as of Unicode 2.0, 1996) be a range of invalid code points yet, Data.Char allows the use of values in this range (in fact, it even gives them their own GeneralCategory).","type_of_failure":"OtherFailure","blocking":[]} -->Edward KmettEdward Kmetthttps://gitlab.haskell.org/ghc/ghc/-/issues/8524GHC is inconsistent with the Haskell Report on which Unicode characters are a...2022-07-15T06:37:07Zoerjan@nvg.ntnu.noGHC is inconsistent with the Haskell Report on which Unicode characters are allowed in string and character literalsGHC is inconsistent with the Haskell Report on which Unicode characters are allowed in string and character literals. (And I don't like either option, why leave out any characters in strings unnecessarily?)
Examples from ghci 7.6.3 (als...GHC is inconsistent with the Haskell Report on which Unicode characters are allowed in string and character literals. (And I don't like either option, why leave out any characters in strings unnecessarily?)
Examples from ghci 7.6.3 (also tested in lambdabot on irc):
```
Prelude> "" -- Unicode char \8203, Format class.
<interactive>:10:2:
lexical error in string/character literal at character '\8203'
Prelude> " " -- Unicode char \8202, Space class.
"\8202"
Prelude> "t\ \est" -- Unicode char \8202 in a string gap.
<interactive>:14:4:
lexical error in string/character literal at character '\8202'
```
My reading of http://www.haskell.org/onlinereport/haskell2010/haskellch2.html
(section 2.2 and 2.6):
- The report BNF token "graphic", which can be used in literals, includes indirectly many Unicode classes, but uniWhite is not one of them. Thus the only Unicode whitespace allowed to represent itself in literals is ASCII space.
- Unicode formatting characters are not mentioned in the BNF that I can see, so are not allowed in literals.
- String gaps are made out of the report BNF token whitespace, which *does* include uniWhite.
Who wants what:
<table><tr><td> </td>
<th> GHC </th>
<th> Report </th>
<th> Me </th></tr>
<tr><td> Format in string </td>
<td> No </td>
<td> No </td>
<td> Yes </td></tr>
<tr><td> Space/uniWhite in string </td>
<td> Yes </td>
<td> No </td>
<td> Yes </td></tr>
<tr><td> Space/uniWhite in string gap </td>
<td> No </td>
<td> Yes </td>
<td> Dunno </td></tr></table>
In short, GHC's behavior is buggy and/or annoying in two opposite ways:
- It leaves out some Unicode characters as allowable in strings and character literals, presumably because the report says so.
- It allows some characters the report says it ''shouldn't*, and refuses some characters the report says it *should''.
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | ------------ |
| Version | 7.6.3 |
| Type | Bug |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"GHC is inconsistent with the Haskell Report on which Unicode characters are allowed in string and character literals","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.6.3","keywords":[],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"Bug","description":"GHC is inconsistent with the Haskell Report on which Unicode characters are allowed in string and character literals. (And I don't like either option, why leave out any characters in strings unnecessarily?)\r\n\r\nExamples from ghci 7.6.3 (also tested in lambdabot on irc):\r\n{{{\r\nPrelude> \"\" -- Unicode char \\8203, Format class.\r\n\r\n<interactive>:10:2:\r\n lexical error in string/character literal at character '\\8203'\r\nPrelude> \" \" -- Unicode char \\8202, Space class.\r\n\"\\8202\"\r\nPrelude> \"t\\ \\est\" -- Unicode char \\8202 in a string gap.\r\n\r\n<interactive>:14:4:\r\n lexical error in string/character literal at character '\\8202'\r\n}}}\r\n\r\nMy reading of http://www.haskell.org/onlinereport/haskell2010/haskellch2.html\r\n(section 2.2 and 2.6):\r\n* The report BNF token \"graphic\", which can be used in literals, includes indirectly many Unicode classes, but uniWhite is not one of them. Thus the only Unicode whitespace allowed to represent itself in literals is ASCII space.\r\n* Unicode formatting characters are not mentioned in the BNF that I can see, so are not allowed in literals.\r\n* String gaps are made out of the report BNF token whitespace, which ''does'' include uniWhite.\r\n\r\nWho wants what:\r\n|| ||= GHC =||= Report =||= Me =||\r\n|| Format in string || No || No || Yes ||\r\n|| Space/uniWhite in string || Yes || No || Yes ||\r\n|| Space/uniWhite in string gap || No || Yes || Dunno ||\r\n\r\nIn short, GHC's behavior is buggy and/or annoying in two opposite ways:\r\n* It leaves out some Unicode characters as allowable in strings and character literals, presumably because the report says so.\r\n* It allows some characters the report says it ''shouldn't'', and refuses some characters the report says it ''should''.\r\n","type_of_failure":"OtherFailure","blocking":[]} -->https://gitlab.haskell.org/ghc/ghc/-/issues/5108Allow unicode sub/superscript symbols in operators2019-07-07T18:56:52Zmikhail.vorozhtsovAllow unicode sub/superscript symbols in operatorsWhile #4373 permits
```
Prelude> let v₁ = 1
```
the following is rejected
```
Prelude> let m >>=₁ f = undefined
<interactive>:0:10: lexical error at character '\8321'
```
Identifiers with non-numeric subscripts are not accepted eith...While #4373 permits
```
Prelude> let v₁ = 1
```
the following is rejected
```
Prelude> let m >>=₁ f = undefined
<interactive>:0:10: lexical error at character '\8321'
```
Identifiers with non-numeric subscripts are not accepted either:
```
Prelude> let vₐ = 1
<interactive>:0:6: lexical error at character '\8336'
```
I wrote a small patch that makes such definitions possible.
1. A new unicode Alex macro, `$subsup`, is introduced and added to `$idchar`, `$symchar`, and `$graphic`
1. A unicode code point is classified as `$subsup` by `alexGetChar` iff either of the following holds:
a. The code point is annotated with \<sub\> or \<super\> in [http://www.unicode.org/Public/UNIDATA/UnicodeData.txt](http://www.unicode.org/Public/UNIDATA/UnicodeData.txt)
> b. It is the \[DOUBLE/TRIPLE/QUADRUPLE\] PRIME (U+2032, U+2033, U+2034, U+2057)
<details><summary>Trac metadata</summary>
| Trac field | Value |
| ---------------------- | -------------- |
| Version | 7.1 |
| Type | FeatureRequest |
| TypeOfFailure | OtherFailure |
| Priority | normal |
| Resolution | Unresolved |
| Component | Compiler |
| Test case | |
| Differential revisions | |
| BlockedBy | |
| Related | |
| Blocking | |
| CC | |
| Operating system | |
| Architecture | |
</details>
<!-- {"blocked_by":[],"summary":"Allow unicode sub/superscript symbols in both identifiers and operators","status":"New","operating_system":"","component":"Compiler","related":[],"milestone":"","resolution":"Unresolved","owner":{"tag":"Unowned"},"version":"7.1","keywords":["lexer","unicode"],"differentials":[],"test_case":"","architecture":"","cc":[""],"type":"FeatureRequest","description":"While #4373 permits\r\n{{{\r\nPrelude> let v₁ = 1\r\n}}}\r\nthe following is rejected\r\n{{{\r\nPrelude> let m >>=₁ f = undefined\r\n\r\n<interactive>:0:10: lexical error at character '\\8321'\r\n}}}\r\nIdentifiers with non-numeric subscripts are not accepted either:\r\n{{{\r\nPrelude> let vₐ = 1\r\n\r\n<interactive>:0:6: lexical error at character '\\8336'\r\n}}}\r\n\r\nI wrote a small patch that makes such definitions possible.\r\n 1. A new unicode Alex macro, {{{$subsup}}}, is introduced and added to {{{$idchar}}}, {{{$symchar}}}, and {{{$graphic}}}\r\n 2. A unicode code point is classified as {{{$subsup}}} by {{{alexGetChar}}} iff either of the following holds:\r\n a. The code point is annotated with <sub> or <super> in [http://www.unicode.org/Public/UNIDATA/UnicodeData.txt]\r\n b. It is the [DOUBLE/TRIPLE/QUADRUPLE] PRIME (U+2032, U+2033, U+2034, U+2057)\r\n","type_of_failure":"OtherFailure","blocking":[]} -->8.0.1