Skip to content

Lexer does not handle unicode numeric subscripts

Hi all,

I would fix this myself but the GHC Lexer looks rather fragile and I'd be afraid of breaking something. I can have a crack at it and write a patch if you like.

Currently GHC rejects perfectly good unicode identifier characters (numeric subscripts):

For example, the following expression:

let v₂ = (+) in v₂ 1 3

gives:

lexical error at character '\8322'

The subscripts are in the "!OtherNumber" general unicode category, so I'm pretty sure the main change is to Lexer.x, changing:

   OtherNumber           -> other_graphic 

To some other category (in the definition of alexGetChar).

The main issue I see here is that we can't just change "other_graphic" to "digit" - it would have to be like ' or _ rather than digit or it would become acceptable to use these for real numeric digits, which I don't think we want.

Seeing as I am not confident enough in GHC's lexer/parser structure to make these changes, I was wondering if anyone who is more experienced who has the time could do it.

Edited by Ian Lynagh -
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information