Non isomorphic parsing/formatting for bold/italic with spaces

Question

Non isomorphic parsing/formatting for bold/italic with spaces

SamyPesse opened this issue 3 years ago · 7 comments

SamyPesse commented 3 years ago

Initial checklist

I read the support docs
I read the contributing guide
I agree to follow the code of conduct
I searched issues and couldn’t find anything (or linked relevant results below)

Affected packages and versions

remark-parse@10.0.0

Link to runnable example

https://codesandbox.io/s/cocky-meitner-88li6

Steps to reproduce

To reproduce, parse the following markdown:

**Our **_**developer**_** guides** and APIs have a home of their own now.

Expected behavior

This markdown snippet works on GitHub:

Our developer guides and APIs have a home of their own now.

Actual behavior

The markdown snipped is being reprocessed at:

**Our **\_**developer**\_\*\* guides\*\* and APIs have a home of their own now.

Runtime

Node v14

Package manager

yarn v2

OS

Linux, macOS

Build and bundle tools

esbuild

Answer 1 · 2021-11-29T15:01:54.000Z

To provide a bit more context, in our application users can select text which leading/trailing spaces and format it as bold/italic, basically something like:

hello<bold> world </bold>!

It was leading to issues when generating markdown with remark, because the following is not a valid markdown:

hello** world **!

So we implemented a custom logic to trim the inner content and move the spaces outside the bold/italic and other marks. But it can lead to more complex tree and remark generated the following markdown:

**Our **_**developer**_** guides** and APIs have a home of their own now.

that it can't parse after.

I'm seeing 2 issues:

remark should probably trim the inner content of bold/italic/code to avoid generating invalid markup(ex it should generate **world** instead of ** world **.
remark cannot parse this markdown that works on GitHub

Answer 2 · 2021-11-29T15:09:06.000Z

Likely related to syntax-tree/mdast-util-to-markdown#12

Answer 3 · 2021-11-29T15:57:48.000Z

remark should probably trim the inner content of bold/italic/code to avoid generating invalid markup(ex it should generate **world** instead of ** world **.

I dunno on the first point. Your code here is generating an object model that is impossible to make with markdown syntax. Take the DOM:

p = document.createElement('p')
h1 = document.createElement('h1')
h1.textContent = 'Hi!'
p.append(h1)

p.outerHTML // "<p><h1>Hi!</h1></p>"

d = document.createElement('div')
d.innerHTML = p.outerHTML;
d.outerHTML // "<div><p></p><h1>Hi!</h1><p></p></div>"

Especially with a vague language like markdown, I think there will always be cases that can easily be represented by JSON but are impossible to serialize/parse.

If you’re generating **Our **_**developer**_** guides**, why not generate **Our _developer_ guides** instead?

remark cannot parse this markdown that works on GitHub

Sure! Minimal repro: *a *__*b*__* c*

Answer 4 · 2021-11-29T17:10:43.000Z

Especially with a vague language like markdown, I think there will always be cases that can easily be represented by JSON but are impossible to serialize/parse.

Yes, I was wondering if the case of trimming spaces in bold/italic should be something handled by remark or not. Maybe it's something we can implement as a plugin, similar to the rehype-minify-whitespace.

Because I can imagine the confusion when the following tree generates an invalid markdown:

{
    type: 'paragraph',
    children: [
        {
            type: 'strong',
            children: [
                {
                    type: 'text',
                    value: 'Hello ',
                },
            ],
        }
    ]
}

If you’re generating Our developer guides, why not generate Our developer guides instead?

Yes, I'm looking at improving this on our side in our step which is going from our AST into the remark AST.

Answer 5 · 2021-11-29T17:25:08.000Z

What do you care most about? That it’s readable markdown? Or that it works?
Because readable would always have such problems (also in Chinese and other languages).

There might be something to be done in CommonMark, e.g., <-** a **-> or so might be possible (although this looks horrible). A character to force them to open or close even when they currently can’t.

And a plugin as you mention might indeed be useful to a lot of folks.

Alternatively, inject HTML instead. <b>, <i> and such?

Answer 6 · 2022-02-04T13:34:37.000Z

I came up with a way to do it, I think: syntax-tree/unist#60 (comment).