Exposing parse utils
pi0 opened this issue ยท 11 comments
Hi. I quickly made this tracker issue while writing unjs/automd#32 to see if you are interested to also expose a simple parse util? (could be either stream or returning whole AST). This can be used as parser core in unjs/omark โค๏ธ
the md4c parser is pretty simple, it just receives 5 hooks:
enter_block
: returns block type and details(ie, heading level)leave_block
: block endenter_span
: returns span type and details(ie, link href)leave_span
: span endtext
: inner text
basically we can create these hooks in host, although calling js functions in wasm module is not ideal, but yes i think i can do it. just don't know are these hooks enough for omark's goal?
I am thinking of the fastest method to resolve the traversed MD tree so omark can make a simplified interface on top of it.
We might try to benchmark two methods:
- Calling hooks in js every time a hook is called
- Construct tree (in native code) and finally call the js method once fully traversed
Please let me know if you like me to try or like to compare yourself ๐๐ผ
i perfer using construct tree, how about md to jsx-likes tree?
# Jobs
Stay _foolish_, stay **hungry**!
[https://apple.com](Apple)
<a href="https://apple.com">Apple</a>
[
{type: 'h1', children: ['Jobs']},
{type: 'p', children: [
'Stay ',
{type: "em", children: ["foolish"]},
', stay ',
{type: "strong", children: ["hungry"]},
'!',
{type: 'a', props: {href: 'https://apple.com'}, children: ['Apple']},
{type: 'html', props: {html: '<a href="https://apple.com">Apple</a>'}, children: []}
]}
]
Honestly, for omark, I am considering a flattened array of streamable data (to make markdown ASTs as simple as possible) + and some alternative ways of nesting.
If you prefer a nested tree like other parsers there is no problem we can always convert ๐๐ผ
how the flattened array
looks like?
how about splitting by blocks? this should work as streamable data
--- chunk 1
{type: 'h1', children: ['Jobs']}
--- chunk 2
{type: 'p', children: [
'Stay ',
{type: "em", children: ["foolish"]},
', stay ',
{type: "strong", children: ["hungry"]},
'!',
{type: 'a', props: {href: 'https://apple.com'}, children: ['Apple']},
{type: 'html', props: {html: '<a href="https://apple.com">Apple</a>'}, children: []}
]}
or use array instead of object:
--- chunk 1
['h1', ['Jobs']]
--- chunk 2
['p', [
'Stay ',
["em", ["foolish"]],
', stay ',
["strong", ["hungry"]],
'!',
['a', {href: 'https://apple.com'}, ['Apple']],
['html', {html: '<a href="https://apple.com">Apple</a>'}, []]
]]
Yes, exactly I am thinking about splitting by logical blocks. But tricky to represent (still thinking how). Mainly I am considering using a Proxy that can access each block either as a stringified value or to be traversed individually. (why? because many use cases of tools simply require the high level representation of markdown AST not details) Something like this:
[
"Jobs", // .{ type: 'h1', contents: <Proxy>[p:stay foolish..a:apple] }
"Stay foolish, stay hungry!", // .{ type: 'p', contents: <Proxy>[.stay, em: ...] }
"Apple" // .{ type: 'a', contents: <Proxy>[apple] }
]
I would love to together brainstorm on this possibility once there! I think for first step we need the parsed AST and I have high hopes to rely on md4w is promised before since it is native an minimal! If you are good with first proposal, #3 (comment) I think we can do it from there.
sounds cool! i will try to implement a mdToJson
function for a start.
I just made a quick wrapper that results (almost) same as your proposed object in omark so we can work in parallel.
The object is meant for internal purposes only and I can happily adjust to what you finally provide but also would love to have your ๐๐ผ on unjs/mdbox#15 if you have few minutes to check so we are safe to go.
thanks