On Windows, out-of-codepage characters can cause GHC build to fail
You can see where this hit us recently on stack with issues 738 and 734. To demonstrate, I'm attaching a UTF-8 encoded Haskell program with some Hebrew characters, and some warnings. The contents of that file are:
module Main ( main , שלום ) where main :: IO () main = putStrLn שלום שלום = "shalom"
If I first set my codepage to 65001 (UTF-8), everything works as expected:
C:\Users\Michael\Desktop>chcp 65001 Active code page: 65001 C:\Users\Michael\Desktop>ghc -fforce-recomp -Wall -ddump-hi -ddump-to-file shalom.hs [1 of 1] Compiling Main ( shalom.hs, shalom.o ) shalom.hs:9:1: Warning: Top-level binding with no type signature: שלום :: [Char] Linking shalom.exe ...
However, if I set my codepage to 437 (US), both the warnings sent to the console, and the .hi dump file, cause GHC to exit prematurely:
C:\Users\Michael\Desktop>chcp 437 Active code page: 437 C:\Users\Michael\Desktop>ghc -fforce-recomp -Wall shalom.hs [1 of 1] Compiling Main ( shalom.hs, shalom.o ) shalom.hs:9:1: Warning: Top-level binding with no type signature: <stderr>: commitBuffer: invalid argument (invalid character)
C:\Users\Michael\Desktop>chcp 437 Active code page: 437 C:\Users\Michael\Desktop>ghc -fforce-recomp -ddump-hi -ddump-to-file shalom.hs [1 of 1] Compiling Main ( shalom.hs, shalom.o ) shalom.dump-hi: commitBuffer: invalid argument (invalid character)
At the very least, I would argue that -ddump-to-file should always dump to the output files as UTF-8, as this is the most useful for tooling. Beyond that, there are a few options here:
- Have all output- including to the console- go out as UTF-8. This may not play terribly nicely with consoles without setting the output codepage.
- Provide a command line option or environment variable to specify "output as UTF-8."
- More radical: change the default way that all Handles work so that UTF-8 is the default, instead of paying attention to code pages and environment variables. Honestly, this is my preference, but it's a bigger discussion than this one bug.
The workaround we've implemented in stack for now is setting the codepage to 65001 for the console while running stack. This is not ideal, since this is essentially a global setting for the entire console.