tc39/proposal-regexp-v-flag

symmetric difference

markusicu opened this issue · 7 comments

Should we add an operator for symmetric difference? The Unicode regex spec (UTS #18) suggests it, and a couple of implementations (Python regex module, Perl experimental) support it. It is a standard set operation. However, it is unclear whether there is a practical use case for it.

If we don’t add it now, we may not be able to add it later without compatibility issues/flags/new syntax.

It is less important than the others, but still useful to express the differences between sets, without pretty ugly expressions:

\p{Lowercase}\p{Lowercase_Letter}--[\p{Lowercase}&&\p{Lowercase_Letter}]

One option would be to reserve ~~ (maybe a few other doubled ASCII symbols as well) for future extension. That is, at first release they would be syntax errors.

For inspecting differences between sets, we usually use A-B and B-A, separately. I don't know what to do with the xor.

But there is the opportunity. Now or never... and reserving the syntax is not much less work than actually specifying and implementing it.

I like the idea of reserving further ASCII punctuation/symbols, doubled or not.

sffc commented

\p{Lowercase}\p{Lowercase_Letter}--[\p{Lowercase}&&\p{Lowercase_Letter}]

Wouldn't this be

\UnicodeSet{\p{Lowercase}\p{Lowercase_Letter}--\UnicodeSet{\p{Lowercase}&&\p{Lowercase_Letter}}}

since the premise of this proposal is to not change the meaning of []?

Shane: Separate question. This issue is about whether to support symmetric difference directly. I am sure Mark wanted to show the set expression without focus on actual proposed ECMAScript syntax. His example would work with what UTS #18 suggests.


We didn't say that we wouldn't use [] at all any more. Inside of the new syntax it would be totally natural, so it might(!) look like \UnicodeSet{\p{Lowercase}\p{Lowercase_Letter}--[\p{Lowercase}&&\p{Lowercase_Letter}]}

sffc commented

ok, sorry to derail the thread

I'd prefer to keep this initial proposal as minimal as possible, i.e. exclude symmetric difference. We can — and should — however reserve syntax to ensure we can extend set notation in the future.

reserving the syntax is not much less work than actually specifying and implementing it.

This is by itself not an argument to add a new language feature. We need to make such decisions based on real-world use cases. Since we already have plenty of figuring out to do even without considering symmetric difference, I'd prefer postponing those decisions to a follow-up proposal.

In PR #13 I added a note about symmetric difference to the propsal FAQ:
https://github.com/tc39/proposal-regexp-set-notation#what-about-symmetric-difference