orangeduck/mpc

parser for """long string""" ?

carueda opened this issue · 2 comments

Question: how to define a parser for arbitrary contents delimited by multi-char indicators? As a concrete case, consider python long strings, eg:

"""
some
contents
"""

My case is actually with {{{ and }}} as delimiters, but any hints are appreciated.

I would assume you'd have to have it be the first rule in your grammar so that it will always take precedence, and then it'd be, more or less {{{ .+ }}}

Thanks for the hint.

Not as straightforward as I thought, here's what I've tried:

mpc_re_mode("{{{.+}}}", MPC_RE_MULTILINE | MPC_RE_DOTALL);

the .+ will eagerly consume everything so not a solution.

mpc_re_mode("{{{[^}]+}}}", MPC_RE_MULTILINE | MPC_RE_DOTALL);

this one works better, but not for contents that include }, which is to be allowed, eg:

{{{
  foo {
    bar ...
  }
}}}

With combinators, I've tried the following block definition:

  mpc_parser_t *no3b = mpc_not(mpc_string("}}}"), free);

  mpc_parser_t *item = mpc_or(2,
                              mpc_many1(mpcf_strfold, mpc_noneof("}")),
                              mpc_and(2,
                                      mpcf_strfold,
                                      no3b,
                                      mpc_or(2, mpc_string("}}"), mpc_string("}")),
                                      free
                              ));

  mpc_parser_t *block = mpc_and(3, mpcf_strfold,
                                mpc_string("{{{"),
                                mpc_many(mpcf_strfold, item),
                                mpc_string("}}}"),
                                free, free);

I was expecting this to be a solution, but it only works if there's no } embedded in the contents, in which case, intriguingly, a segmentation fault occurs. Maybe a mistake in the definition?

Now, the following grammar-based block definition seems to work as needed:

  mpc_parser_t *no3b = mpc_new("no3b");
  mpc_parser_t *item = mpc_new("item");
  mpc_parser_t *block = mpc_new("block");

  mpc_define(no3b, mpc_not(mpc_string("}}}"), free));

  mpc_define(item,
             mpca_grammar(MPCA_LANG_WHITESPACE_SENSITIVE,
                          " /[^}]+/  |  ( <no3b> (\"}}\" | \"}\" ) ) ",
                          no3b, NULL));

  mpc_define(block,
             mpca_grammar(MPCA_LANG_WHITESPACE_SENSITIVE,
                          "  \"{{{\" <item>* \"}}}\"  ",
                          item, NULL));

I'll do more testing with this one and eventually go with it.

AFAICT, this grammar-based one should be basically equivalent to the only-combinator one above.

To summarize the exercise (and happy to enter other tickets as convenient):

  • why the segfault mentioned above? (the parser definition seems correct to me)
  • possible additional MPC features (not really sure how difficult to implement):
    • an mpc_until combinator that accepts anything until the given parser.
    • something like " !<parser> ..." to expose the mpc_not combinator at the lang/grammar level.

Thanks.