Engelberg/instaparse

Is it possible to use a G4 grammar from Instaparse? (Clojure grammar)

timothypratley opened this issue · 4 comments

I think G4 is an ANTLR thing, I'm not sure why they like that format but it does not seem to be natively compatible with Instaparse... at least I tried loading this grammar:
https://github.com/antlr/grammars-v4/blob/master/clojure/Clojure.g4
And got an error.

I think I have to translate G4 to EBNF?

Are there any tools or examples I could draw on here?

My apologies if there is a better forum to ask this question in!

I am not familiar with G4, so unfortunately I don't have any advice to offer on how to do the translation.

No worries! On closer inspection, I think the only differences are:

  • G4 has comments as: /* multi-line */ and // single line
  • fragment <-- I'm not sure what this is exactly but well it seems like a way to specify part of a rule for reuse.

So I suspect translating them is pretty easy, I'll report back with more details if I can get it working

Just for reference, this is what I came up with:

file: form * ;

<form>: literal | list | vector | map | reader_macro;

<forms>: form * ;

list: <'('> forms <')'> ;

vector: <'['> forms <']'> ;

map: <'{'> (form form)* <'}'> ;

set: <'#{'> forms <'}'> ;

reader_macro
    : lambda
    | meta_data
    | regex
    | var_quote
    | host_expr
    | set
    | tag
    | discard
    | dispatch
    | deref
    | quote
    | backtick
    | unquote
    | unquote_splicing
    | gensym
    ;

quote: <'\''> form ;

backtick: <'`'> form ;

unquote: <'~'> form ;

unquote_splicing: <'~@'> form ;

tag: <'^'> form form ;

deref: <'@'> form ;

gensym: SYMBOL <'#'> ;

lambda: <'#('> form* <')'> ;

meta_data: <'#^'> (map form | form) ;

var_quote: <'#\''> symbol ;

host_expr: <'#+'> form form ;

discard: <'#_'> form ;

dispatch: <'#'> symbol form ;

regex: <'#'> string ;

literal: string | number | character | nil | BOOLEAN | keyword | symbol | param_name ;

string: STRING;
hex: HEX;
bin: BIN;
bign: BIGN;
number: FLOAT | hex | bin | bign | LONG ;

character : named_char | u_hex_quad | any_char ;
named_char: CHAR_NAMED ;
any_char: CHAR_ANY ;
u_hex_quad: CHAR_U ;

nil: NIL;

keyword: macro_keyword | simple_keyword;
<simple_keyword>: ':' symbol;
<macro_keyword>: ':' ':' symbol;

symbol: ns_symbol | simple_sym;
<simple_sym>: SYMBOL;
<ns_symbol>: NS_SYMBOL;

param_name: PARAM_NAME;

<STRING> : <'"'> #"(^\"|\\\")*" <'"'>;

<FLOAT>
    : '-'? #"[0-9]+" FLOAT_TAIL
    | '-'? 'Infinity'
    | '-'? 'NaN'
    ;

<FLOAT_TAIL>: FLOAT_DECIMAL FLOAT_EXP | FLOAT_DECIMAL | FLOAT_EXP ;

<FLOAT_DECIMAL>: '.' #"[0-9]+" ;

<FLOAT_EXP>: #"[eE]" '-'? #"[0-9]+" ;
<HEXD>: #"[0-9a-fA-F]" ;
<HEX>: '0' #"[xX]" HEXD+ ;
<BIN>: '0' #"[bB][10]+" ;
<LONG>: '-'? #"[0-9]+[lL]?";
<BIGN>: '-'? #"[0-9]+[nN]";

<CHAR_U> : '\\' 'u'#"[0-9D-Fd-f]" HEXD HEXD HEXD ;

<CHAR_NAMED>: '\\' ( 'newline' | 'return' | 'space' | 'tab' | 'formfeed' | 'backspace' ) ;

<CHAR_ANY>: '\\' #"." ;

<NIL> : 'nil';

<BOOLEAN> : 'true' | 'false' ;

<SYMBOL>: '.' | '/' | NAME ;

<NS_SYMBOL>: NAME '/' SYMBOL ;

<PARAM_NAME>: '%' (#"[1..9][0...9]*"|'&')? ;

<NAME>: SYMBOL_HEAD SYMBOL_REST* (':' SYMBOL_REST+)* ;

<SYMBOL_HEAD>: #"[^0..9\^`\\\"#~@:/%\()\[\]{} \n\r\t,]" ;

<SYMBOL_REST>: SYMBOL_HEAD | #"[0..9]" | '.' ;

<COMMENT>: ';' #"[^\r\n]*" ;

It's not quite right but I'm going to come back to it later.