baynezy/Html2Markdown

Problem with emphasis and whitespace

Closed this issue · 7 comments

First off, thanks for the great project. Big time saver for what I'm doing.

According to CommonMark, emphasis tags like ** are ignored if they are immediately followed by a space. However, if I feed something like this into Html2Markdown:

"This is a<strong> test</strong> that causes problems."

This is what gets returned:

"This is a** test** that causes problems."

...And then if that is converted back into HTML for display by a CommonMark-compliant converter, the emphasis is ignored and I just get asterisks. This is of course happening because there's a space to the right of the strong tag.

Is there a workaround for this other than me manually writing a regex or doing some sort of recursive replace?

Thanks!

In case this helps anyone, I ended up doing some interesting things to make this work:

            var searchTags = new List<string>()
            {
                "strong",
                "em"
            };

            foreach (var tag in searchTags)
                html = ShuffleWhitespace(html, tag);

            var markdown = converter.Convert(html);

And the code that moves the whitespace around is here:

        /// <summary>
        /// See this issue: https://github.com/baynezy/Html2Markdown/issues/99
        /// So, we shuffle the whitespace right after the opening tag to before the tag.
        /// And we do the converse for the closing tag.
        /// </summary>
        private string ShuffleWhitespace(string html, string tag)
        {
            var openTag = $"<{tag}>";
            var closeTag = $"</{tag}>";

            html = Regex.Replace(html, $@"{openTag}\s+", " " + openTag);
            html = Regex.Replace(html, $@"\s+{closeTag}", closeTag + " ");

            return html;
        }

At first, ShuffleWhitespace used recursion to ensure all whitespace was handled, but then I switched to a regex. Not as fun but more efficient in some cases.

...Please tell me there's a built-in solution to this so I can delete this code. :)

@kinetiq - thanks for the bug. I will think about how best to resolve that. I am glad you have a work around in the meantime.

@kinetiq - can you please review PR #101 - I think I have resolved your issue, but I would appreciate your validation.

At a glance, this looks like the right idea to me! Thanks for getting on this. If you do a nuGet release, I will confirm on my end.

@kinetiq 3.3.0.402 is now published and ready.

@baynezy Okay, I tested this and we've solved half the problem. I found more. I actually thought I saw handling for this in your pull request, but I guess not.

Consider this case, noting the whitespace right before the closing strong tag:

"This is a<strong> test </strong> that causes problems."

This now renders:

"This is a **test ** that causes problems."

The first ** is perfect now. But on the closing **, we have the same problem, in reverse. The space before the final ** escapes the markdown just like before.

So close! If you fix it I will do another test. Thanks again for your work on this very helpful project.

@kinetiq - this should now be resolved in 3.3.1.407 which is published to NuGet.