BNFC/bnfc

Parse quoted token like String

ScottFreeCode opened this issue · 8 comments

I am having trouble figuring out how I can write a token like the built-in String but with different quotation marks and, in my Abs.hs result, only get the content inside the quotes like String does. If I create a token with some kind of quotes it seems that whatever interprets the Abs members has to remove them, whereas this is not necessary for String as far as I've noticed. Is there a way to achieve this?

My first thought was to define the token as just the stuff inside the quotes, then use it a la MyQuotedType . MyQuotedType ::= "'" MyToken "'"; But this causes other keywords to be parsed (or perhaps lexed? I'm not terribly familiar with the distinction) as MyToken even though they are not preceded by the quote; and that breaks (such that it won't even parse) code that was successfully parsing in the version where the quotes are part of the token and get manually unquoted in the interpreter.

(I realize ' is used by Char, but I don't happen to be using Char. And, I could always change it to backticks or something if I ended up needing to. Some languages use / as quotes for regular expressions. Ideally I'd be able to specify the quotation marks to use/remove.)

Relatedly, do the content of strings need to be unescaped (e.g. \" -> " and \\ -> \) or does the parser also handle that? Could the parser handle it for custom quoted tokens similarly if so?

I am having trouble figuring out how I can write a token like the built-in String but with different quotation marks and, in my Abs.hs result, only get the content inside the quotes like String does.

This is unfortunately not possible with BNFC. The token types Char, Double, Integer, String are hard-wired and do something special.
All user-defined token types are represented as string in the abstract syntax, and these strings contain the whole string that matched the respective regular expression.

Thanks for the clarification @andreasabel !

Do you think it would be difficult to add either a directive to the grammar such as quoted '<open>' '<close>' "<escapes>" token … or, more flexibly, a hook to embed Haskell postprocessing (ideally even allow type conversion, but at least String -> String/Text -> Text) in the parsing of a given type or constructor without having to manually modify the generated code?

A workaround would be that you record a patch that you could apply each time after BNFC has run (via patching the Makefile).

a hook to embed Haskell postprocessing (ideally even allow type conversion, but at least String -> String/Text -> Text) in the parsing of a given type or constructor

One design for this would be #267, but I welcome spinning more design ideas!

I tried patching the lexer like so based on the string handling:

diff --git a/src/MyPackage/Lex.x b/src/MyPackage/Lex.x
index 1234567..1234567 100644
--- a/src/MyPackage/Lex.x
+++ b/src/MyPackage/Lex.x
@@ -25,19 +25,19 @@ $u = [. \n]          -- universal: any character
    \, | \[ | \] | \{ | \} | \: | \[ \]
 
 :-
 
 
 $white+ ;
 @rsyms
     { tok (\p s -> PT p (eitherResIdent TV s)) }
 \' ([$u # [\' \\]] | \\ [\' \\ f n r t]) * \'
-    { tok (\p s -> PT p (eitherResIdent T_MyToken s)) }
+    { tok (\p s -> PT p (eitherResIdent T_MyToken $ unescapeInitTail s)) }
 
 $l $i*
     { tok (\p s -> PT p (eitherResIdent TV s)) }
 \" ([$u # [\" \\ \n]] | (\\ (\" | \\ | \' | n | t | r | f)))* \"
     { tok (\p s -> PT p (TL $ unescapeInitTail s)) }
 
 $d+
     { tok (\p s -> PT p (TI s))    }
 $d+ \. $d+ (e (\-)? $d+)?
@@ -117,18 +117,19 @@ unescapeInitTail :: Data.Text.Text -> Data.Text.Text
 unescapeInitTail = Data.Text.pack . unesc . tail . Data.Text.unpack
   where
   unesc s = case s of
     '\\':c:cs | elem c ['\"', '\\', '\''] -> c : unesc cs
     '\\':'n':cs  -> '\n' : unesc cs
     '\\':'t':cs  -> '\t' : unesc cs
     '\\':'r':cs  -> '\r' : unesc cs
     '\\':'f':cs  -> '\f' : unesc cs
     '"':[]    -> []
+    '\'':[]   -> []
     c:cs      -> c : unesc cs
     _         -> []
 
 -------------------------------------------------------------------
 -- Alex wrapper code.
 -- A modified "posn" wrapper.
 -------------------------------------------------------------------
 
 data Posn = Pn !Int !Int !Int

This mostly works. However, without the patch I am able to use a grammar that can match two forms:

x y

OR

x 'thing'

And 'thing' can be 'y'.

With the patch, the one case that now fails is that x 'y' is interpreted the same as x y.

I'm guessing I didn't correctly modify the lexer to do what strings are doing.

Any ideas?

(I can rig up a minimal reproducible test if that would help.)

Ah, nevermind, I took a second look at the code, saw the definition of eitherResIdent and realized this was overriding 'y' with y if the contents matches an identifier, and changed + { tok (\p s -> PT p (eitherResIdent T_MyToken $ unescapeInitTail s)) } to + { tok (\p s -> PT p (T_MyToken $ unescapeInitTail s)) }

Well, I guess I should probably leave these comments up here in case anyone else makes the same mistake. Correct patch:

diff --git a/src/MyPackage/Lex.x b/src/MyPackage/Lex.x
index 1234567..1234567 100644
--- a/src/MyPackage/Lex.x
+++ b/src/MyPackage/Lex.x
@@ -25,19 +25,19 @@ $u = [. \n]          -- universal: any character
    \, | \[ | \] | \{ | \} | \: | \[ \]
 
 :-
 
 
 $white+ ;
 @rsyms
     { tok (\p s -> PT p (eitherResIdent TV s)) }
 \' ([$u # [\' \\]] | \\ [\' \\ f n r t]) * \'
-    { tok (\p s -> PT p (eitherResIdent T_MyToken s)) }
+    { tok (\p s -> PT p (T_MyToken $ unescapeInitTail s)) }
 
 $l $i*
     { tok (\p s -> PT p (eitherResIdent TV s)) }
 \" ([$u # [\" \\ \n]] | (\\ (\" | \\ | \' | n | t | r | f)))* \"
     { tok (\p s -> PT p (TL $ unescapeInitTail s)) }
 
 $d+
     { tok (\p s -> PT p (TI s))    }
 $d+ \. $d+ (e (\-)? $d+)?
@@ -117,18 +117,19 @@ unescapeInitTail :: Data.Text.Text -> Data.Text.Text
 unescapeInitTail = Data.Text.pack . unesc . tail . Data.Text.unpack
   where
   unesc s = case s of
     '\\':c:cs | elem c ['\"', '\\', '\''] -> c : unesc cs
     '\\':'n':cs  -> '\n' : unesc cs
     '\\':'t':cs  -> '\t' : unesc cs
     '\\':'r':cs  -> '\r' : unesc cs
     '\\':'f':cs  -> '\f' : unesc cs
     '"':[]    -> []
+    '\'':[]   -> []
     c:cs      -> c : unesc cs
     _         -> []
 
 -------------------------------------------------------------------
 -- Alex wrapper code.
 -- A modified "posn" wrapper.
 -------------------------------------------------------------------
 
 data Posn = Pn !Int !Int !Int

Judging from playing with the Test program, the printer might also need adjusting to restore the quotes around MyToken. I'll have to take a look at that at whatever point I need the printer.