Skip to content

UTF-8 decoding doesn't work correctly on GHC 9.2.1+AArch64 NCG

Summary

With GHC 9.2.1, GHC.IO.Encoding.utf8 fails to decode non-ASCII text.

It happens on

  • arm64-darwin / AArch64 NCG / ghc-9.2 branch
  • aarch64-linux / AArch64 NCG / GHC 9.2.1

but does not happen on

  • x86_64
  • arm64-darwin / LLVM backend / GHC 8.10.7
  • arm64-darwin / base library built with LLVM backend (hadrian/build --flavour=default+llvm) / ghc-9.2 branch
  • arm64-darwin / AArch64 NCG / master branch

So I guess that a bug in AArch64 NCG whose fix is not backported to ghc-9.2 is affecting here.

Steps to reproduce

The following program

import GHC.Foreign
import GHC.IO.Encoding
import qualified Foreign.C.String as F

main = withCStringLen utf8 "\x1F424" $ \csl -> do
  s <- F.peekCAStringLen csl
  print s
  s <- peekCStringLen utf8 csl
  print s

prints

"\240\159\144\164"
string.hs: recoverDecode: invalid argument (invalid byte sequence)

Expected behavior

It should output

"\240\159\144\164"
"\128036"

Environment

  • GHC version used: 9.2.1 release / ghc-9.2 branch (ac4496ce)
  • Operating System: macOS
  • System Architecture: AArch64
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information