Perl-Apollo/Corinna

Consider Unicode Identifiers

wollmers opened this issue · 5 comments

Not that I myself would ever use it, but specifying the allowed characters for identifiers different from Perl would confuse users.

This is the definition for identifiers in https://perldoc.perl.org/perldata:

/ (?[ ( \p{Word} & \p{XID_Start} ) + [_] ])
  (?[ ( \p{Word} & \p{XID_Continue} ) ]) *    /x

IMHO it's well specified to extend a BNF by a notation conforming to Unicode.

Same for

METHODNAME       ::= [a-zA-Z_]\w*

which has \w as continuation characters and is the same as \p{Word} under use utf8 and is not the same as [a-zA-Z0-9_].

Ovid commented

I would love to see something like this, but I suspect the scope would be far outside of Corinna and would like complicate parsing. I also strongly suspect that P5P would reject something like this. For now, this is outside the scope of V1. Sorry.

haarg commented

While the spec says things like [a-zA-Z_]\w*, I expect it implementation it would follow the standard rules for perl identifiers, which do allow unicode.

@haarg

While the spec says things like [a-zA-Z_]\w*, I expect it implementation it would follow the standard rules for perl identifiers, which do allow unicode.

That's what I also expected, that P5P will not define an extra parser for Cor. Since 5.18 under use utf8 it's defined as follows (see https://perldoc.perl.org/perldata#Identifier-parsing):

/ (?[ ( \p{Word} & \p{XID_Start} ) + [_] ])
  (?[ ( \p{Word} & \p{XID_Continue} ) ]) *    /x

This uses Unicode properties made exactly for identifiers, where XID_Start also contains letters outside ASCII or Latin, and '_' is added.

That's the definition of \p{Word} which is the same as \w in https://unicode.org/reports/tr18/#Default_Word_Boundaries:

\p{alpha}
\p{gc=Mark}
\p{digit}
\p{gc=Connector_Punctuation}
\p{Join_Control}

That's what I also expected, that P5P will not define an extra parser for Cor.

Indeed so. I was fully intending to just continue to use core bits-and-pieces for as much of this as possible, for consistency, rather than rebuild entire new things from scratch. I'm viewing the spec verymuch as a hand-wavy suggestion in this kind of sense.