/coco_to_antlr

Converter from Coco/R to ANTLR4 grammar

Primary LanguageC++

coco_to_antlr

Convert your Coco/R attributed grammar to an ANTLR4 one. This tool automates most of the conversion (depending on the complexity of your embedded instrumentation code):

  • rewrite production rules to ANTLR parser productions
  • rewrite Token definitions to ANTLR lexer rules
  • rewrite PRAGMAS and COMMENT definitions to ANTLR lexer rules on -> channel(HIDDEN)
  • rewrite CharSet definitions to fragment lexer rules
  • copy embedded instrumentation code & (partially) adapt to different API
    • for C++, rewrite reference attributes to @returns[]
    • replace references to current/lookahead token t, la by $ctx (NOTE: these usually still need manual rewrite)
    • correctly reference arguments of the current scope (i.e. @returns[int no] needs to be assigned like $no =)

TODO

See also the various TODO/FIXME comments in the source code, this list is only an excerpt from these

  • ANY in Coco production rules has no equivalent in ANTLR

  • character set definitions using set union & difference are not correctly rewritten

  • IF(...) conflict resolvers work differently in ANTLR and have to be manually checked (are preserved as comments)

  • comments are not preserved since version 67a4266 (when support for rule return attribute rewriting was implemented). Best way to preserve comments currently is to use version 784f8fd and the current master and merge the results (git rebase is your friend).

  • name collisions of generated rule & attribute names with language keywords and/or ANTLR syntax are not checked (e.g. a rule char or instrumentation code using a reference to a rule text which collides with $text, see antlr/antlr4#2266).

  • NESTED comment definitions are not translated (though they should be possible with recursive lexer rules)

  • using an external lexer (i.e. only listing token names, not defining them) is not supported

  • string/character escapes are partly converted, need manual checking

  • Tokens with CONTEXT specification are not supported

  • Test with instrumentation language other than C++ (see e.g. antlr/antlr4#2265)

  • Test on windows, linux only so far

  • Rewrite Token & CharSet definitions with AST & Visitor, too, like productions already do

  • Optimize code generation from AST (i.e. remove unneeded braces, especially for quantifiers around a single subrule, like ('TOKEN')?)

  • Parsing of (C++) attributes in rule definition and invocation is quite fragile

  • $ in instrumentation code fails ANTLR