skvadrik/re2c

Overloading precedence marker (e.g. (pattern)) with 'posix' group markers (also parenthesis) causes loss of full functionality

Closed this issue · 6 comments

The overloading of the posix group boundaries markers (e.g. (group) ) and the normal precedence group (e.g. also parenthesis) make all but trivial patterns impractical/impossible to define. (i.e. most non-trivial patterns contain the normal precedence markers, thus seem to be incompatible with the posix group markers. And the trivial cases are forced to be known by their position rather than the pattern type they matched.)

Is there (or could there be) a way to ‘override/redefine’ the markers (e.g. () ) that posix group uses, thus freeing up the use of the normal precedence markers (e.g. () ) along with the posix submatch groups?

Better yet, the optimal solution would look something like @x(pattern) … and when that pattern is matched, a callback (macro) sends back the start, end, a whatever X was given… This would overcome the limitation of stag, by allowing repetition (like mtag), and additionally, would improve the efficiency over the mtag method (e.g. not having function calls ‘during’ the pattern match phase, like mtag, but only at the end (or when accept of sub patterns) like posix (and stag)… and would overcome mtag not allowing the same tag being used more than once in the solution… In summary, it finds the boundaries, then communicates not only ‘where’ they are but ‘what’ they bound.

If there is ‘any’ way that the current system is capable of doing this, please let me know. If not, what is the feasibility of enhancing the capability to handle this use case efficiently and simply. If you have any question on what I mean, please let me know, and I will try to clarify.

Thanks

Hi @dtp555-1212, there are two subproblems here:

  • adding named capturing groups that would allow ordinary parenthesized group to be non-capturing
  • the "callback" idea that would allow repeated capturing groups to be matched more efficiently

The two subproblems are orthogonal, so let's consider each of them separately.

Named capturing groups are a good idea, and not very difficult to implement. They can be used both with POSIX longest-match semantics, and with Perl leftmost-greedy semantics. The main question here is syntax. I like your suggestion @name(...), but before deciding on something it is a good idea to consider existing syntax like (<name> ... ), and also it should not introduce any syntactic ambiguity. There also must be a way to refer to the opening and closing parentheses of each named group, so instead of a tag name must be a pair of tags (a struct perhaps).

The "callback" idea is problematic. Maybe I don't fully understand what you suggest, but I don't think it can improve m-tags efficiency. Saving repeated matches requires storing submatch values on all repetitions along the way, and that is what YYMTAGP and YYMTAGN are for. They don't have to be implemented as functions --- instead, they can be C macros or free-form pieces of code that are inlined by re2c. But they have to store the necessary information during matching. Using trie-like data structure allows to implement that efficiently.

Thanks for your quick reply, and being open to the concepts. I totally agree with your wanting to nail down a consistent, unambiguous syntax. I am open to anything that would solve the use case.

A named ‘closing’ parenthesis would certainly disambiguate it, but it could also be that be a rule that the parenthesis must be perfectly nested, which would automatically disambiguate the matching parenthesis. (That would be sufficient for ‘my’ use case, but I guess you ‘could’ considering allowing ‘overlapping’ regions, but that seems like it may cause a lot of complications.

As for the mtag inefficiencies, I alluded to … from what I see, that macro is instantiated ‘a lot’. Since per the mtag example code, that turns into a function call, that could add a lot of bulk and overhead to the code, for large grammars. I ‘think’ since a callback (a macro) would only get inserted for ‘accepts’ I suspect that would be very infrequent compared to what I see in some grammars that use the mtag protocol. Also, the current mtag protocol doesn’t seem to have a way to communicate the ‘what’, only the 'where' (and even the where seems to count on the ‘many’ instances of the YYMYAGP & YYMYAGN macros, as each call only is a start, end, null, not the entire span in one shot). As envisioned, only at the ‘accept’ of the pattern would the callback (a macro) need to be called with the start, end, and the ‘name’ of the pattern. I guess it ‘could’ be called repeatedly for the ‘minimal’ through all potential matches, if there were multiple ambiguous solutions, and then the application could decide what to such as extend the range, branch, etc… (because it would have all the data to do so). However, with lookahead, I suspect most well formed grammars are going to resolve to ideally a single solution (or very few solutions).

In the end, being able to achieve this use case ‘anyway’ is better than no way. I love the re2c program. Because the performance has been improved over time, I am trying more and more complex grammars. Thanks for all your hard work! I will obviously make myself available as a resource for reviewing/testing any proposal you suggest for a solution.

Thanks again

P.S. Here was an idea for a ‘failed attempt’ at a workaround that may spark an idea… I tried to get the addresses of the tag, so I could refer to the address when the mtag was reported. The suffered from two issues… There was not a 1-to-1 correspondence for the ‘t’ being sent via the macro. And, the internal yytxx variable associated with the tag was changing whenever there was a change to the grammar.

Another observation is that the posix form looks very efficient internally (no function calls triggered by macro expansion…) and knows all the sub matches at the conclusion of the function. That looks like the ideal place to be able to instantiate a macro that would convey back the start, end and ‘name’ of the spans, rather than insert positionally into the yypmatch array.

Hope that helps. (But ignore, if you have another straightforward plan in mind)

Unfortunately, submatch extraction is a more complex problem than it might seem.

I ‘think’ since a callback (a macro) would only get inserted for ‘accepts’ I suspect that would be very infrequent compared to what I see in some grammars that use the mtag protocol.

This assumption is wrong. It is not possible to wait until the lexer is in accepting state and then save all m-tags, because by that time the necessary information is not available anymore. When a lexer is in accepting state, how is it to know what parts of the matched input correspond to different tags? Right, it needs to save tags along the way. For s-tags this means saving a pointer, and for m-tags it means saving a list of pointers.

The suffered from two issues… There was not a 1-to-1 correspondence for the ‘t’ being sent via the macro. And, the internal yytxx variable associated with the tag was changing whenever there was a change to the grammar.

And this is another misunderstanding. There is no 1-to-1 corresponding between tags and tag variables for a reason.

First, it may be impossible to implement one tag (s-tag or m-tag) with just one variable. Under the hood, every path in DFA is a bunch of parallel NFA paths (recall that a DFA state is a set of NFA states). Each NFA path may have its own tagged transitions for the same tag, and as a result one tag may have different values on different NFA paths that all go to the same DFA state. At the end, only one of the parallel NFA paths comes to an accepting state and "wins", and the final tag value is its value on that winning NFA path. But it is unknown which path is going to win until the end of match, therefore it is necessary to save all possible tag values. That's how one tag ends up having multiple versions in one DFA state. In re2c terminology, the maximum number of different versions of the same tag in a DFA state is called the tag's "degree of nondeterminism", and there is a warning -Wnondeterministic-tags which warns about tags with degree 2 or more.

Second, one variable may be enough to implement multiple tags. This becomes quite obvious if you think of cases like a* @x @y b*, where tags x and y are both in the same place. But there are less obvious cases, where different tags do not interfere with each other, and can both use the same tag variable. In truth, the relation between tags and tag variables is similar to the relation between C-level variables and CPU registers (and re2c allocates tag variables with an algorithm similar to register allocation).

That's why tag variables change so much when you change the grammar, and re2c tries to hide these internal details.

If you are interested in the gory details of submatch extraction, there are papers about it:

Thanks for the insights... To be clear, there may have been some been some comingling of ideas between the posix and mtag paradigm. I think I understand why the mtag paradigm does what is does, and it makes sense why it is done the way it is done. However, in the 'posix' paradigm, all the examples of the code generated when using that (so far), 'appears' to make the assignment to the yypmatch array at the conclusion of the routine. (Maybe I just haven't seen a counter example.) As such, at the conclusion of the routine, it knows the final resolved start and end for each submatch. If that is true, then sending the answers back via a callback macro rather than inserting them into yypmatch, 'seems' like a plausible alternative, but as mentioned, any solution is fine. If those observations don't help to lead to one, that is ok. I am not trying to enforce how you do it. I was just trying to help by sparking some additional ideas. Hope that helps.

Thanks again for your thought and consideration of this use case.

I added syntax (? ...) for non-capturing groups: 1edd25d.