Problem with emphasis and whitespace
Closed this issue · 7 comments
First off, thanks for the great project. Big time saver for what I'm doing.
According to CommonMark, emphasis tags like ** are ignored if they are immediately followed by a space. However, if I feed something like this into Html2Markdown:
"This is a<strong> test</strong> that causes problems."
This is what gets returned:
"This is a** test** that causes problems."
...And then if that is converted back into HTML for display by a CommonMark-compliant converter, the emphasis is ignored and I just get asterisks. This is of course happening because there's a space to the right of the strong tag.
Is there a workaround for this other than me manually writing a regex or doing some sort of recursive replace?
Thanks!
In case this helps anyone, I ended up doing some interesting things to make this work:
var searchTags = new List<string>()
{
"strong",
"em"
};
foreach (var tag in searchTags)
html = ShuffleWhitespace(html, tag);
var markdown = converter.Convert(html);
And the code that moves the whitespace around is here:
/// <summary>
/// See this issue: https://github.com/baynezy/Html2Markdown/issues/99
/// So, we shuffle the whitespace right after the opening tag to before the tag.
/// And we do the converse for the closing tag.
/// </summary>
private string ShuffleWhitespace(string html, string tag)
{
var openTag = $"<{tag}>";
var closeTag = $"</{tag}>";
html = Regex.Replace(html, $@"{openTag}\s+", " " + openTag);
html = Regex.Replace(html, $@"\s+{closeTag}", closeTag + " ");
return html;
}
At first, ShuffleWhitespace used recursion to ensure all whitespace was handled, but then I switched to a regex. Not as fun but more efficient in some cases.
...Please tell me there's a built-in solution to this so I can delete this code. :)
@kinetiq - thanks for the bug. I will think about how best to resolve that. I am glad you have a work around in the meantime.
At a glance, this looks like the right idea to me! Thanks for getting on this. If you do a nuGet release, I will confirm on my end.
@baynezy Okay, I tested this and we've solved half the problem. I found more. I actually thought I saw handling for this in your pull request, but I guess not.
Consider this case, noting the whitespace right before the closing strong tag:
"This is a<strong> test </strong> that causes problems."
This now renders:
"This is a **test ** that causes problems."
The first ** is perfect now. But on the closing **, we have the same problem, in reverse. The space before the final ** escapes the markdown just like before.
So close! If you fix it I will do another test. Thanks again for your work on this very helpful project.