haskell-suite/haskell-src-exts

Unable to parse legal UTF-8 function names

fosskers opened this issue · 3 comments

I decided to venture into no-man's land and define a type whose fields had non-ASCII names:

data Entry = Entry { 漢字 :: Kanji
                   , 部首 :: Kanji
                   , 親 :: Kanji }

This compiles fine, but unfortunately haskell-src-exts (and therefore stylish-haskell) is unable to process it:

ParseFailed (SrcLoc "<unknown>.hs" 31 22) "Illegal character ''\\28450''\n"

which is produced by this guard branch in InternalLexer.hs:

| otherwise -> do
discard 1
case c of
-- First the special symbols
'(' -> return LeftParen
')' -> return RightParen
',' -> return Comma
';' -> return SemiColon
'[' -> return LeftSquare
']' -> return RightSquare
'`' -> return BackQuote
'{' -> do
pushContextL NoLayout
return LeftCurly
'}' -> do
popContextL "lexStdToken"
return RightCurly
'\'' -> lexCharacter
'"' -> lexString
_ -> fail ("Illegal character \'" ++ show c ++ "\'\n")

Yes I'm evil for using UTF-8 field names, but we should still probably be able to parse these anyway, since they're legal as far as GHC is concerned.

Thoughts? Thanks for your on-going efforts.

I'll merge a patch which fixes it but these days I'm not going to spend any more time myself fixing problems with the parser.

Understood, thank you.

EDIT: got it wrong the first time.


Isn't the actual problem that it's falling through to that otherwise on L808 in the first place? It should match on a previous case on L778:

c:_ | isDigit c -> lexDecimalOrFloat
| isUpper c -> lexConIdOrQual ""
| isLower c || c == '_' -> do

Also in need of fixing here:

isIdent c = isAlphaNum c || c == '\'' || c == '_'