[Proposal] more permissive definition of allowed symbol syntax
terefang opened this issue · 43 comments
currently the definition is as in https://htmlpreview.github.io/?https://github.com/jgm/djot/blob/master/doc/syntax.html#symbols
which says:
Surrounding a word with : signs creates a “symbol,” which by default is just rendered literally ...
so the implementation is highly parser and renderer specific.
- in the Haskell implementation at https://github.com/jgm/djoths/blob/f1ab25c2205f46ab4923ed2a66ed91b4971eda04/src/Djot/Djot.hs#L400C11-L400C65 a word is defined as a literal.
- but here https://github.com/jgm/djoths/blob/f1ab25c2205f46ab4923ed2a66ed91b4971eda04/src/Djot/Inlines.hs#L214 it is:
pSymbol = do
asciiChar ':'
bs <- byteStringOf $ skipSome (skipSatisfyByte
(\c -> c == '+' || c == '-' ||
(isAscii c && isAlphaNum c)))
asciiChar ':'
- in the Lua implemantaion at https://github.com/jgm/djot.lua/blob/a0583ef8270d025b3e86ef521b35f7397cf7215b/djot/inline.lua#L403 a word is defined as
-- 58 = :
[58] = function(self, pos, endpos)
local sp, ep = bounded_find(self.subject, "^%:[%w_+-]+%:", pos, endpos)
if sp then
self:add_match(sp, ep, "symbol")
return ep + 1
else
self:add_match(pos, pos, "str")
return pos + 1
end
end,
so it seems that as least comparing the Haskell and Lua implementations there is some disagreement
Proposal
- allow the syntax of html entities/symbol/emojies -- https://www.w3schools.com/html/html_entities.asp
- allow the syntax of adobe postscript-style glyph names -- https://github.com/adobe-type-tools/agl-specification
- allow the syntax of opentype font glyph names -- https://silnrsi.github.io/FDBP/en-US/Glyph_Naming.html
- allow common naming practices by common css icon-libraries such as Font Awesome Icons, Google Material Design Icons, or Bootstrap Icons (ie.
[a-zA-Z0-9\_\-]+
).
Use-Cases
XML/HTML Renderer
HTML Entity
:apos:
to be rendered as'
:Euro:
to be rendered as&Euro;
HTML Entity Code Point
:#60:
to be rendered as<
or<
:#x2014:
to be rendered as—
or—
or—
or"—"
Icon Font Glyph Name
:fa-bars:
to be rendered as<i class="fa fa-bars"></i>
:fa+fa-bars:
to be rendered as<i class="fa fa-bars"></i>
This may be subject to that actual implementation and/or configuration of the html-renderer backend.
Icon or Symbol Font Glyph Name for PDF, Image, or Unicode Text Renderer
:a19:
to be rendered as "✓" (from Zapf Dingbats font) -- (U+2713 CHECK MARK ✓, ✓
)
Possible Variation
it might be desirable to clearly separate the verbatim html entities from glyph names by using a prefix for indication.
:*a19:
to be rendered as "✓" (from Zapf Dingbats font) -- (U+2713 CHECK MARK ✓, ✓
)
Possible candidates would be : ^
, !
, &
, $
, %
, /
, =
, ?
, +
, ~
, *
, #
So a possible syntax could be in perl-style regular expression:
/^[\^\!\&\$\%\/\=\?\+\~\*\#]?[[:alnum:]\.\_\-\+]+$/
Graceful Fallback Mechanism
where a renderer backend might not be able to recognize which symbol or glyph to actually render it might fallback i the following ways:
pure text style
:apos:
to be rendered as:apos:
:Euro:
to be rendered as:Euro:
:#60:
to be rendered as:#60:
:#x2014:
to be rendered as:#x2014:
:fa-bars:
to be rendered as:fa-bars:
:fa+fa-bars:
to be rendered as:fa+fa-bars:
:a19:
to be rendered as:a19:
:*a19:
to be rendered as:*a19:
html style
:fa-bars:
to be rendered as<code>:fa-bars:</code>
:fa+fa-bars:
to be rendered as<code>:fa+fa-bars:</code>
:a19:
to be rendered as<code>:a19:</code>
:*a19:
to be rendered as<code>:*a19:</code>
Entities that should always be recognized
- Entities as specified as numeric eg.
 
, 
, etc. - https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
There's not supposed to be a difference in parsing, but the way symbols are rendered is an implementation detail.
The line you point to in the Haskell code to establish a difference in parsing has nothing to do with parsing, that's part of the djot renderer.
since i know nothing about Haskell, it was a guess tbo, but may i have a valid proposal apart from the code ?
Actually, you're right that the Haskell implementation doesn't allow _
, while the Lua and JavaScript implementations do. That's something I'll need to fix in djoths.
I do want to keep symbols "abstract," in the sense that no particular rendering is mandated for them (they can be treated in different ways by the renderer). But your main suggestion is to broaden the characters allowed in symbols, to make it easier to use them for things like :ߞ
or :*a19:
. At the least, you'd want to allow &
, #
, and *
, in addition to the things that are already allowed (-
, _
, +
, ASCII alphanumerics).
I'm open to that, I suppose. Comments from others welcome.
At the least, you'd want to allow
&
,#
, and*
, in addition to the things that are already allowed (-
,_
,+
, ASCII alphanumerics).I'm open to that, I suppose. Comments from others welcome.
yes that is what i would suggest, and having feedback from other sources both pro and con is appreciated.
I do want to keep symbols "abstract," in the sense that no particular rendering is mandated for them (they can be treated in different ways by the renderer).
with pandoc in mind, i wouldnt mandate a particular rendering either.
so the things above should be ment as examples or possible use-cases rather than strict rules.
@jgm I wrote a JS filter which converts symbols looking like a decimal :331:
or hexadecimal :0x14b:
number into the Unicode character with the corresponding codepoint ‹ŋ›. I also wrote a filter converting [gemoji][] "names" to emoji. There are one or two conflicts like :100:
but that resolves itself if you run the emoji filter first and use :0x64:
in the unlikely case that you would want to represent ‹d› that way.
so you actually like my proposal ?
because then the :100:
emoji and the d
/ d
would be different
@terefang I like the idea but not the "syntax": we already have the colons so the full {HT,X}ML entity syntax is overkill and the #
/#x
stuff is a PITA when the rest of the world (even JS) uses 0x
and regex can match a string of ASCII digits: for example :&0x64:
/:&100:
, :&0x4d2:
/:&1234:
for codepoints and/or :@100:
/:@1234:
for emoji is quite enough.1
I say allow any non-blank ASCII char other than :
itself at least as a prefix and it should be a huge improvement. Perhaps also medial .
so that one can use syntaxes like :e.100:
/:u.100:
though IMO :E+100:
would look quite nice and already works!
Footnotes
-
Actually my filter already supports
:U+014B:
but case insensitive and not requiring more than one digit. It simply doesval = val.replace( /^U\+([0-9a-f]+)$/i, '0x$1');
before doingvar cp = Number(val);
. Easy peasy! ↩
hmm ... so it would be written all:
:apos:
:Euro:
:60:
:0x2014:
:fa-bars:
:fa+fa-bars:
:a19:
- with
:100:
being ambiguous ?
With my current filters :100:
is indeed ambiguous: you need to run the emoji filter before the chars filter and use :0x64:
for the char. My idea was to use :&100:
/:&0x64:
/:u.100:
/u.0x64:
for the char and :@100:
/:e.100:
, or :E+100:
by analogy with :U+64:
, for the emoji.
The UI, my keyboard or my brain (probably the last since the colons aren't included in the string the filter deals with) omitted some colons in my previous post. Should hopefully be fixed now.
FWIW :100:
with my current filters is not a huge problem because you can run the emoji filter before the chars filter and nobody in their right mind would use a symbol for ‹d›! OTOH :1234:
might be a real problem because someone might actually want to use a symbol for that char. Anyway I have added the :E+1234:
syntax to my emoji filter so now I can disambiguate! 🙂
@bpj your js might actually benefit from my proposal the most because of those icon libraries, isnt it so ?
that one can use syntaxes like
:u.100:
Not sure I get it all -- but in my current implementation (of a renderer), I use :U+xxxx:
to specify unicode characters as symbols.
@terefang I'm mostly with @jgm: what to do with symbols should be up to filters/renderers but some more permitted characters between the colons to allow a bit of "Hungarian typing" among symbols would be welcome, or even some kind of actual namespaces for symbols, although that is probably overkill.
Symbols were redefined from "emoji" to avoid hardcoding a table of alias—emoji mappings in the parser, which was a good thing to do. Not everyone uses emoji, or any of the other things you list, and the data structures needed for each of them are big, albeit not enormously big (although I have a script which generates a table mapping all non-Han Unicode names and aliases to chars, and that is enormous 😁 even though only a fraction of all possible Unicode chars are assigned yet!), so it is good design to let filters/renderers decide what to use symbols for, so that everyone can use symbols for what suits them and only that. The problem is that some naughty users may want to use symbols for more than one thing, making the possibility of some kind of (pseudo-) namespacing desirable. The set of characters allowed in symbols were originally determined by what gemoji aliases use, so it makes sense to extend the set of permitted characters when symbols are to be used for other things as well. It is easy to use regex or substring comparison to determine whether any given symbol has the right format for a given use case, so there is no need to hardcode anything on the parser side. The alias—emoji mappings are hardcoded in my filter by inlining the JSON, but that is mostly because reading files in JS isn't possible without third-party libraries. A more "serious" application than my filter can address that.
@Omikhleia Absolutely: my filter also recognises :U+XXXX:
and my emoji filter now recognises :E+...:
but that is not really good as a general namespacing scheme.
By the way, since I use pseudo-footnotes as symbol definitions in my rendering layer, in my own use case the following should theoretically work:
[^:roll-on-the-floor-laughing:] :U+1F923:
...
Lorem ipsum :roll-on-the-floor-laughing: dolor sit amet...
("theoretically" = I didn't try, as I don't use emojis -- but I know it works for :copy:
likewise defined as :U+00A9:
(i.e. ©)-- allowing users to define some aliases for Unicode characters, without the parser or renderer having to provide explicit support for such names. Just saying.
how would you use the glyphs/icons from the "font-awesome" font ?
my suggestion would be :fa+fa-bars:
but a particular renderer implementation might allow only :fa+bars:
or :bars:
to be used.
how would you separate icon-sets used in parallel ? (fontawesome, octoicons, material design)
:fa+bars:
, :o+bars:
, :zd+bars:
?
@terefang Isn't fa-bars
Unicode U+F0C9 ? Should Djot syntax really be concerned of having names for these font-specific mappings (i.e. including the implicit font change), or should it be something else? Personally, of course, I'd tend to say :fa-bars
could be aliased to [:U+F0C9:]{custom-style="FontAwesome"}
(or use a class, if you do not like key-value attributes...) where the renderer is responsible for implementing styling using the appropriate font... But that might just be me ;)
my use case would be to have a pdf renderer, the implementation would need to infer from the symbol what is ment by the content author.
i would assume that a content author using the symbol syntax, would be either be familiar with html-style entities or has registered a particular icon font in the rendering backend.
tbo, i have currently a test-case setup where an iconfont ("octoicons.svg") is registered under a prefix ("o").
so i have currently identified the following use cases:
- generic html entity in the current selected font (eg.
:Euro:
, or:Pound:
) - generic unicode point in the current selected font (eg.
:#100:
, or:#x64:
) - a symbol from a registered symbolic font (eg.
:o+finance:
)
some donts:
- my case against using raw unicode to icon mappings is that multiple iconfonts may map different symbols under the same code-point in the extension or free-usage area and create conflicts.
- also why would the content author need to know the unicode of a symbol, if he does not need this to know to use an emoji which is just another unicode character.
- also why force upon the user a very unfamiliar style (eg.
[:U+F0C9:]{custom-style="FontAwesome"}
), assumung that most content authors would easiely recognize the usage similar to css-style (eg.:fa+fa-bars:
or:fa+bars:
) which is also much shorter and less error-prone.
hope that helped understanding.
Should Djot syntax really be concerned of having names for these font-specific mappings (i.e. including the implicit font change),
Djot should not be concerned and have not a hardcoded list (of words) to work, but rather allow a syntax for a particular backend renderer to do its job.
just allowing more than just identifier characters in symbol names (eg. adding &
, $
, %
, ?
, +
, ~
, *
, #
, .
or even only &
, +
, *
, #
, .
) for the specification should be enough, and not represent a particular bad impact on a parser implementation.
that would result in the following allowance: [A-Z]
, [a-z]
, [0-9]
, .
, -
, _
, &
, +
, *
, #
but i really need to add that i really like the suggestion of the :U+[:hex:]:
usage as a shorthand/alternative to :&#x[:hex:]:
or :#x[:hex:]:
or :0x[:hex:]:
.
i would like to propose the following language in the specification:
Symbols
Surrounding a word with :
signs creates a "symbol," which by default is just rendered literally but may be treated specially by a filter. (For example, a filter could convert symbols to emojis or entities or icons. But this is not built into djot.)
My reaction is :+1: :smiley:.
To be precise, the allowed characters in a symbols word are [A-Z]
, [a-z]
, [0-9]
, .
, -
, _
, &
, +
, *
, and #
.
my use case would be to have a pdf renderer (...)
Same use case for me, I'm also rendering to PDF 👍
i would assume that a content author using the symbol syntax, would be either be familiar with html-style entities or has registered a particular icon font in the rendering backend.
I wouldn't necessarily assume a familiarity with "HTML-style" entities -- my "writers" don't really know HTML (or to some extent only).
But the initial question goes much farther than just HTML entity name. Font awesome, octicons, material design, emojis... and what's next? Huge custom tables for all of these evolving things? And possibly, what about eventual overlaps? (see below)
... also why would the content author need to know the unicode of a symbol, if he does not need this to knwo to use an emoji which is just another unicode character.
Authors don't have to know -- the aliases might be provided in another (definition) file.
.... also why force upon the user a very unfamiliar style (eg.
[:U+F0C9:]{custom-style="FontAwesome"}
), assumung that most content authors would easiely recognize the usage similar to css-style (eg.:fa+fa-bars:
or:fa+fa-bars:
) which is also much shorter and less error-prone.
Now authors have to know CSS-like styles?
Nah :) They type :fa-bars:
or whatever in their document. That's their "shortcut", of course. My only point is to say that symbol expansion can / might be made with aliases, possibly at a higher level, because this is what it is eventually.
TL,DR... But what is a "smiley face"?
fa-smile-o
= U+F118 in the font awesome font?- U+263A as a Unicode character?
- U+1F60A as a Unicode emoji?
- Some SVG octicons (not even sure those map to Unicode in any way)?
- Some SVG material icon or perhaps U+E0ED?
The "authors" should therefore perhaps just type :smiley:
(for instance) in their documents. The ultimate "composer" might alias it to something equivalent to [:U+F0C9:]{custom-style="FontAwesome"}
or ![](octicons/smile.svg){height="0.9em"}
or [:U+1F60A]{custom-style="SomeNotoEmojiThing"}
or whatever they actually want in the final output. I hope this helps, I don't guarantee this to be fully sound, despite using such "hacks" of sorts (not to bother my authors with weird names or fancy fa-
or o-
etc. they don't really know).
that would result in the following allowance:
[A-Z]
,[a-z]
,[0-9]
,.
,-
,_
,&
,+
,*
,#
Somewhat unrelated, but while we are at it:
Is there a good reason why symbols are currently limited to ASCII, digits and a few special chars?
Why not allowing :numéro:
or :номер:
to be valid symbols?
my use case would be to have a pdf renderer (...)
Same use case for me, I'm also rendering to PDF 👍
i would assume that a content author using the symbol syntax, would be either be familiar with html-style entities or has registered a particular icon font in the rendering backend.
I wouldn't necessarily assume a familiarity with "HTML-style" entities -- my "writers" don't really know HTML (or to some extent only).
but the content author would be familiar with the symbols she need from given documentation of the implementation
But the initial question goes much farther than just HTML entity name. Font awesome, octicons, material design, emojis... and what's next? Huge custom tables for all of these evolving things? And possibly, what about eventual overlaps? (see below)
nobody expects you to implement any tables, unless it is for your own use-case.
a symbol from a registered symbolic font (eg.
:o+finance:
)... also why would the content author need to know the unicode of a symbol, if he does not need this to knwo to use an emoji which is just another unicode character.
Authors don't have to know -- the aliases might be provided in another (definition) file.
yes they will get it fro the documentation of the backend or by their own doing from "implementation depended"
.... also why force upon the user a very unfamiliar style (eg.
[:U+F0C9:]{custom-style="FontAwesome"}
), assumung that most content authors would easiely recognize the usage similar to css-style (eg.:fa+fa-bars:
or:fa+fa-bars:
) which is also much shorter and less error-prone.Now authors have to know CSS-like styles? Nah :) They type
:fa-bars:
or whatever in their document. That's their "shortcut", of course. My only point is to say that symbol expansion can / might be made with aliases, possibly at a higher level, because this is what it is eventually.
again, they dont need to know, but a particular implementation could leverage on existing prior knowledge.
why force backend implementors doing another indirection with lookup-tables (you already disregarded above) because of a limitation of allowed characters.
it would be simpler just to allow a larger set of characters to remove that requirement.
TL,DR... But what is a "smiley face"?
fa-smile-o
= U+F118 in the font awesome font?- U+263A as a Unicode character?
- U+1F60A as a Unicode emoji?
- Some SVG octicons (not even sure those map to Unicode in anyway)?
- Some SVG material icon or perhaps U+E0ED?
The "authors" should therefore perhaps just type
:smiley:
(for instance) in their documents. The ultimate "composer" might alias it to something equivalent to[:U+F0C9:]{custom-style="FontAwesome"}
or![](octicons/smile.svg){height="0.9em"}
or[:U+1F60A]{custom-style="SomeNotoEmojiThing"}
or whatever they actually want in the final output. I hope this helps, I don't guarantee this to be fully sound, despite using such "hacks" of sorts (no to bother my authors with weird names or fancyfa-
oro-
etc. they don't really know).
while i like for content creators to simply use :smiley:
, i would also like to spare them setting up lookup-tables and/or macros to get there in the first place.
that said, an implementation could leverage on the data that is already present -- like reading glyph-names from Truetype-/Opentype-/Type1-/SVG-Fonts.
maybe i am just thinking way to advanced into the workflow but follow be thru the following example:
- content author registers octoicons.svg under the "o" prefix in the "hypothetical" output renderer.
- "hypothetical" output renderer will read the svg font and extract proper glyphnames for symbol usage
- the documentation of the "hypothetical" output renderer says just to use
:
iconname:
(eg. from the icon spec octoicons), in case multiple icon-fonts are registered with conflicting names the syntax:
prefix+
iconname:
should be used. . - the content-creator now can use
:news:
or:o+news:
from the octoicons symbol set with the "hypothetical" output renderer.
that would result in the following allowance:
[A-Z]
,[a-z]
,[0-9]
,.
,-
,_
,&
,+
,*
,#
Somewhat unrelated, but while we are at it: Is there a good reason why symbols are currently limited to ASCII, digits and a few special chars? Why not allowing
:numéro:
or:номер:
to be valid symbols?
good point -- me thinks that the lua-based reference implementation of "%w" will only match ascii letters.
but for me:
[A-Z]
– would be synonymous to the unicode character class "Uppercase Letter" - https://www.compart.com/en/unicode/category/Lu[a-z]
– would be synonymous to the unicode character class "Lowercase Letter" - https://www.compart.com/en/unicode/category/Ll[0-9]
– would be synonymous to the unicode character classes ""Decimal Number", "Letter Number", "Other Number" - https://www.compart.com/en/unicode/category/Nd - https://www.compart.com/en/unicode/category/Nl - https://www.compart.com/en/unicode/category/No
while i personally could live with the ascii limitation, i would not like to enforce it
Requiring ASCII was a pragmatic decision -- I wanted to make it easy to implement lightweight parsers for djot. Say we allow non-ASCII alphanumerics. Well, then every djot parser needs code that (a) parses UTF-8 byte sequences to code points and (b) determines which code points are alphanumerics. This is actually a decent amount of additional complexity which we can avoid by requiring ASCII here. There isn't anywhere else where djot parsing requires determining character classes of non-ASCII characters.
You might say: well, any decent language has these built in! Not C. Not Lua.
You might say: well, any decent language has these built in! Not C. Not Lua.
i know, that is why i did not mention it specifically, and i would like to use a lua-filter within pandoc for this.
and i am with @jgm ... doing internationlization for internationalization sake might be actually a bad decision.
pragmatic and keep-it-simple-stupid !
while Unicode is the target we should strife to, ASCII is our basis.
Requiring ASCII was a pragmatic decision -- I wanted to make it easy to implement lightweight parsers for djot. Say we allow non-ASCII alphanumerics. Well, then every djot parser needs code that (a) parses UTF-8 byte sequences to code points and (b) determines which code points are alphanumerics. This is actually a decent amount of additional complexity which we can avoid by requiring ASCII here. There isn't anywhere else where djot parsing requires determining character classes of non-ASCII characters.
hmm ... while as a quick fallback ... not considering correct character classes.
if you are processing symbol markup in 8bit mode only, any byte > 127 could be treated as a word character - that would also satisfy UTF-8.
Yes, we don't need to worry about character classes if we just want to accept any non-ASCII character as a character in a symbol. But I don't think we do; these include, for example, lots of spacing characters, accents, and all manner of things.
i would like to propose the following updated language in the specification:
Symbols
Surrounding a word with :
signs creates a "symbol," which by default is just rendered literally but may be treated specially by a filter. (For example, a filter could convert symbols to emojis or entities or icons. But this is not built into djot.)
My reaction is :+1: :smiley:.
Notes:
-
To be precise, the allowed characters in a symbols word are
[A-Z]
,[a-z]
,[0-9]
,.
,-
,_
,&
,+
,*
, and#
. -
Per default the allowed word characters are considered ASCII (ie. 8-bit clean)
-
for implementations that allow UTF-8 and implement proper unicode character classes,
[A-Z]
,[a-z]
, and[0-9]
would be synonymous to the unicode character classes "Uppercase Letter", "Lowercase Letter", "Decimal Number", "Letter Number", and "Other Number"
for implementations that allow UTF-8 and implement proper unicode character classes, [A-Z], [a-z], and [0-9] would be synonymous to the unicode character classes "Uppercase Letter", "Lowercase Letter", "Decimal Number", "Letter Number", and "Other Number"
Your proposal, if I understand it correctly, makes parsing implementation-dependent. I don't think that's good. There should be one answer to the question, "is this text a symbol?"
@jgm I’m totally clear and OK as to why symbols are restricted to ASCII. I just think that all printable (i.e. not controls and perhaps not space either) except :
itself should be allowed, which would, among other things, allow some more “natural”-looking namespacing by convention, as this discussion has shown to be desirable.
Yesterday I added the “syntax” :E+alias:
to my emoji filter, but honestly that “naming scheme” only makes sense in the light of :U+XXXX:
for Unicode code points. Something like :e(alias):
for emoji and :c(0xHHHH):
/:c(DDD):
/:c(EntityName):
for chars, or :e=alias:
/:c=nnn:
would look better for the general case I think. Sigils like :@alias:
or :&nnn:
would look nice, but you would soon run out of usable sigil chars!
I’m thinking that maybe paired brackets () {} []
inside symbols might contain whitespace and properly nested balanced brackets or any non-brackets inside them?1
Footnotes
-
FWIW I have attempted to write a JS implementation of my “sprintf on steroids” inspired by String::Formatter, currently implemented as a Perl subclass and in MoonScript/Lua using an Lpeg/re parser. When using it in Pandoc filters I take format strings from Code(Block) elements and/or metadata, but it would be nice to be able to use symbols in djot.
It uses an extended sprintf syntax with arguments and multi-char conversion names in curly brackets:
%-0M,N{ARG}C %{ARG}{CONV} %{(ARG 1)(ARG 2)(ARG 3)}C %1.2{price}f %{0x14b}c %{(KEY 1)(KEY 2)(KEY 3)}s %{(353)(230)(331)}c
(It allows
$
as an alternative to%
mostly for smoother inclusion in YAML and on the Vim command line where%
is a reserved char.)Basically its either a single arg optionally containing properly balanced nested curlies or one or more arguments in parens, optionally separated/surrounded by whitespace and/or containing balanced nested parens and/or curlies. (Obviously the parser looks for the multi-arg case first!) You pass a format string and a table with data for lookup by key, multiple args usually being
or
ed together, using the first non-null value whose key corresponds to an argument.Regular sprintf/string.format conversions work out of the box but the program using the class may pass a table mapping custom conversion names to functions, which are passed the class instance, the arguments and the data table and may do whatever they want with them, including treating arguments as nested format strings. However during parsing arguments are not checked for syntactic validity, only for parens and/or curlies being properly balanced.
Custom conversion names may contain any chars/bytes except
{
and}
, and are of course allowed to override regular conversions. (The only built-in custom conversion is ac
which- takes the codepoints directly from the format arguments in the format string,
- accepts
0xHH
numbers, - uses Lua’s
utf8.char
function, so has Unicode/UTF-8 support, - accepts a limited number of HTML entity nanes for those chars which are reserved in the formatter syntax
% $ { } ( )
(and in Perl any Unicode name or HTML 5 entity name).
’m thinking that maybe paired brackets
() {} []
inside symbols might contain whitespace and properly nested balanced brackets or any non-brackets inside them?
talking to people brings fruits ... actually i totally disregarded the whitespace case.
although i see the use-case, that goes head over heels and way beyond the intention of a "replacement symbol".
and while it is possible to to simply parse unicode character classes with regex, you needs would require significat complexity.
for implementations that allow UTF-8 and implement proper unicode character classes, [A-Z], [a-z], and [0-9] would be synonymous to the unicode character classes "Uppercase Letter", "Lowercase Letter", "Decimal Number", "Letter Number", and "Other Number"
Your proposal, if I understand it correctly, makes parsing implementation-dependent. I don't think that's good. There should be one answer to the question, "is this text a symbol?"
then i would like to propose the following updated language in the specification:
Symbols
Surrounding a word with :
signs creates a "symbol," which by default is just rendered literally but may be treated specially by a filter. (For example, a filter could convert symbols to emojis or entities or icons. But this is not built into djot.)
My reaction is :+1: :smiley:.
Notes:
-
To be precise, the allowed characters in a symbols word are
[A-Z]
,[a-z]
,[0-9]
,.
,-
,_
,&
,+
,*
, and#
. -
The allowed word characters are considered ASCII (ie. 8-bit clean)
-
for implementations that use regular expressions and/or allow UTF-8 and implement proper unicode character classes,
[A-Z]
,[a-z]
, and[0-9]
would be only synonymous to the POSIX character classes "[[:upper:]]
", "[[:lower:]]
", "[[:alpha:]]
", "[[:alnum:]]
", and "[[:digit:]]
", but not the extended unicode equivalents of "\p{XPosixAlpha}
", "\p{XPosixAlnum}
", "\p{XPosixDigit}
", "\p{XPosixLower}
", "\p{XPosixUpper}
", and "\p{L}
", "\p{N}
".
Can you explain why we need the second and third clauses? Isn't the first clause unambiguous and sufficient to specify the syntax? Are there some regex implementations that include non-ASCII characters in [a-z]
?
a good example is your usage of %w
in lua. Perl has a similar one \w
that can be made unicode aware by setting options.
Technically all Lua character classes are locale dependent, so that if I set for example a Swedish 8-bit locale whatever bytes it uses to encode ‹ÅÄÖåäö› are included in %w
but in practice people only ever use the C locale, so %w
is equivalent to ASCII [a-zA-Z0-9]
. The real gotcha is that Lua %w
doesn't include the underscore so you must use [_%w]
to match what \w
matches in most regex flavors.
As for Perl in recent versions \w
etc. are Unicode-aware by default. You have to use [_a-zA-Z0-9]
or the /\w/a
(for ASCII) modifier to match the "classical" set.
So %w
is locale-dependent, but [a-z]
is not. So if we use the latter to specify this we don't need your second and third bullet points, right?
then i would like to propose the following updated language in the specification:
Symbols
Surrounding a word with :
signs creates a "symbol," which by default is just rendered literally but may be treated specially by a filter. (For example, a filter could convert symbols to emojis or entities or icons. But this is not built into djot.)
My reaction is :+1: :smiley: :Skull&Bones:.
Note: To be precise, the allowed characters in a symbols word are [A-Z]
, [a-z]
, [0-9]
, .
, -
, _
, &
, +
, *
, and #
.
👍
will you close this after you have updated spec and code ?