expose Sourcepos info to SyntaxHighlighterAdapters
Gankra opened this issue · 9 comments
I'm working on a project that wants to provide a warning when the user uses an unknown language on a codefence, and we use miette's fancy machinery for printing out spans in the original source. It would be great if we could actually print out the line/col where we found the problem. It seems like comrak has that info in its AST, but the SyntaxHighlighterAdapter api doesn't expose it. comrak also buffers up intermediate results so we can't pull any "substring of the original input, do pointer comparisons" shenanigans.
There's a few different ways to introduce this kind of functionality (break the old trait methods, add new trait methods that default call the old ones, ...). Is there one you'd prefer?
👋🐰 Hiya, lovely to see you here!
Break the old trait methods! Sourcepos should be supplied everywhere, it's just such a recent introduction (#232) that some of the extensibility bits added before it don't have it. Seeing as we're still pre-1.0, I'm happier making breaking changes for a more correct API now.
I think more-or-less adding the sourcepos: Option<Sourcepos>
parameter (e.g.) should be sufficient.
In this case we have two things we'd want the Sourcepos of: the lang identifier and the block. Wasn't sure if you had any examples like that.
I mean I guess they're always(???) separated by a single newline so maybe you only need one..?
(Oh wait even if that's true there's gunk with \r\n
that makes even a newline an unpredictable length...)
Oh boy I am reading the code to see how complicated it is to properly find those sub-sourceposes of the NodeCodeBlock and I am quickly learning that this is a very complex parser and of course code blocks have a bunch of extra wobbly crap to adjust for (markdown parsing complex?? what a shock!!! 😸)
Ah, I see -- yep, we only record sourcepos "proper" at a per-node basis in the AST. A code block is one big node! Currently there's some things you can infer, but some things you can't. e.g., given this input:
Hi!
``` abc def
code
```
^- Note there's a leading space here.
The code block node has sourcepos 3:2-5:4
, which is kind of helpful. The NodeCodeBlock
has a fence_length
of 3, fence_offset
of 1, and info
of "abc def"
. Note that you can't quite infer e.g. the exact sourcepos of the info string, because the leading space has been trimmed. Everything else can be, though! I'd be happy with adding either the length of the trimmed prefix (to make the rest inferable), or, actually just add an info_sourcepos
to the NodeCodeBlock
and record it explicitly.
Another gotcha worth noting: if you add another leading space before the closing ```
there, it's still a valid fenced code block close, but that leading space is (by spec) ignored. The sourcepos for the code block becomes 3:2-5:5
, since 5:5
is indeed its last character. You can have three leading spaces (total); four and nope. (This is modulo other indentation rules currently in effect, e.g. lists. Cursed.) But this isn't enough to reconstruct, because trailing spaces after the ending fence are A-OK and count towards the sourcepos too. Without adding something else here -- and again, maybe it's just an explicit sourcepos for the closing fence -- there isn't fidelity to tell exactly where in the document those backticks (or tildes!) might be.
(also, it's worth noting that you'll never see the difference between \r\n
and \n
when processing the AST -- line endings as defined by spec include both; when storing literal data (such as the literal
field of NodeCodeBlock
), we append \n
for line endings.)
edit: grammar, syntax
I'd be happy with adding either the length of the trimmed prefix (to make the rest inferable), or, actually just add an info_sourcepos to the NodeCodeBlock and record it explicitly.
Yeah I think you're probably far more qualified to make this adjustment than me. Just to be clear on what I think a user would plausibly want to know, in decreasing order:
- the span of the body (literal?) of the code block (so a syntax highlighter can output errors as offsets within the input it processes, and that can just be directly applied to the sourcepos of the body to get an absolute span in the document)
- the span of the info components (so you can point at it and be like "idk what this language is")
- the span of the entire code block (what the AST stores? Although I'm not sure when I'd want to point at the whole thing)
Right, understood!
(To be clear, I don't mean to say I actually have the time/energy to do the adding of it myself! Just what I'd be fine with adding to the data model. There's been a lot of maintenance work to do on Comrak lately, and it's already unsustainable given lack of meaningful sponsorship etc.)
Totally understood. I'll look over your comment a bit closer because it's definitely filling in some of the context I was missing (and I might do some driveby docs/comments to help others contribute more too).