ije/md4w

Exposing parse utils

pi0 opened this issue ยท 11 comments

pi0 commented

Hi. I quickly made this tracker issue while writing unjs/automd#32 to see if you are interested to also expose a simple parse util? (could be either stream or returning whole AST). This can be used as parser core in unjs/omark โค๏ธ

ije commented

the md4c parser is pretty simple, it just receives 5 hooks:

  • enter_block: returns block type and details(ie, heading level)
  • leave_block: block end
  • enter_span: returns span type and details(ie, link href)
  • leave_span: span end
  • text: inner text

basically we can create these hooks in host, although calling js functions in wasm module is not ideal, but yes i think i can do it. just don't know are these hooks enough for omark's goal?

pi0 commented

I am thinking of the fastest method to resolve the traversed MD tree so omark can make a simplified interface on top of it.

We might try to benchmark two methods:

  • Calling hooks in js every time a hook is called
  • Construct tree (in native code) and finally call the js method once fully traversed

Please let me know if you like me to try or like to compare yourself ๐Ÿ‘๐Ÿผ

ije commented

i perfer using construct tree, how about md to jsx-likes tree?

# Jobs
Stay _foolish_, stay **hungry**!
[https://apple.com](Apple)
<a href="https://apple.com">Apple</a>
[
  {type: 'h1', children: ['Jobs']},
  {type: 'p', children: [
    'Stay ',
    {type: "em", children: ["foolish"]},
    ', stay ',
    {type: "strong", children: ["hungry"]},
    '!',
    {type: 'a', props: {href: 'https://apple.com'}, children: ['Apple']},
    {type: 'html', props: {html: '<a href="https://apple.com">Apple</a>'}, children: []}
  ]}
]
pi0 commented

Honestly, for omark, I am considering a flattened array of streamable data (to make markdown ASTs as simple as possible) + and some alternative ways of nesting.

If you prefer a nested tree like other parsers there is no problem we can always convert ๐Ÿ‘๐Ÿผ

ije commented

how the flattened array looks like?

ije commented

how about splitting by blocks? this should work as streamable data

--- chunk 1
{type: 'h1', children: ['Jobs']}
--- chunk 2
{type: 'p', children: [
  'Stay ',
  {type: "em", children: ["foolish"]},
  ', stay ',
  {type: "strong", children: ["hungry"]},
  '!',
  {type: 'a', props: {href: 'https://apple.com'}, children: ['Apple']},
  {type: 'html', props: {html: '<a href="https://apple.com">Apple</a>'}, children: []}
]}

or use array instead of object:

--- chunk 1
['h1', ['Jobs']]
--- chunk 2
['p', [
  'Stay ',
  ["em", ["foolish"]],
  ', stay ',
  ["strong", ["hungry"]],
  '!',
  ['a', {href: 'https://apple.com'}, ['Apple']],
  ['html', {html: '<a href="https://apple.com">Apple</a>'}, []]
]]
pi0 commented

Yes, exactly I am thinking about splitting by logical blocks. But tricky to represent (still thinking how). Mainly I am considering using a Proxy that can access each block either as a stringified value or to be traversed individually. (why? because many use cases of tools simply require the high level representation of markdown AST not details) Something like this:

[
  "Jobs", // .{ type: 'h1', contents: <Proxy>[p:stay foolish..a:apple] }
  "Stay foolish, stay hungry!", // .{ type: 'p', contents: <Proxy>[.stay, em: ...] }
  "Apple" // .{ type: 'a', contents: <Proxy>[apple] }
]

I would love to together brainstorm on this possibility once there! I think for first step we need the parsed AST and I have high hopes to rely on md4w is promised before since it is native an minimal! If you are good with first proposal, #3 (comment) I think we can do it from there.

ije commented

sounds cool! i will try to implement a mdToJson function for a start.

pi0 commented

I just made a quick wrapper that results (almost) same as your proposed object in omark so we can work in parallel.

The object is meant for internal purposes only and I can happily adjust to what you finally provide but also would love to have your ๐Ÿ‘๐Ÿผ on unjs/mdbox#15 if you have few minutes to check so we are safe to go.

ije commented

thanks

ije commented

@pi0 #4 the first test has passed(not finished, can't handle the nesting blocks/spans yet)