trentm/python-markdown2

Enclosing `fenced-code-blocks` in a `<div>` tag renders incorrect HTML

bow opened this issue · 5 comments

bow commented

Hi maintainers

I have recently found a combination of raw HTML tag + Markdown (with fenced-code-blocks enabled) that I believed renders incorrectly.

Here is small test case that demonstrates the behavior.

Given the following markdown.md file:

<div class="enclosing">
```python
x = 1
```
</div>

and the following command

python lib/markdown2.py --extras fenced-code-blocks markdown.md

The following HTML is rendered:

<div class="enclosing">
<div class="codehilite">
<pre><span></span><code><span class="n">x</span> <span class="o">=</span> <span class="mi">1</span>
</code></pre>
</div>

<p></div></p>

Here, the final enclosing <div> becomes a part of a new <p> pair, leaving the outermost <div> with class "enclosing" unclosed.

The behavior I expected is to render this HTML:

<div class="enclosing">
<div class="codehilite">
<pre><span></span><code><span class="n">x</span> <span class="o">=</span> <span class="mi">1</span>
</code></pre>
</div>
</div>

In fact, this was actually the case up to version 2.4.3. I have not done a fully exhaustive search, but #462 seems to have introduced this behavior.

As additional info, I tested this with CPython 3.10.4, on a8bc182 using Pygments 2.13.0.


If no one has looked into this yet, I'd be happy to take a deeper dive and submit a PR 🙂.

Good catch, thanks! Yeah if you'd like to dive deeper please do

I am also rendering my documents with version 2.4.3 with Python 3.9 on debian and windows.
Versions 2.4.4 ff will fail on my page-breaks in the markdown source.

sub header

It seems that the the following sub header after the page-break will not be observed.

I took a look into this. The problem lies in the _hash_html_blocks function and the _strict_block_tag_re regex.
Essentially, it attempts to match against HTML block tags (like a div) and then hash them. However, the fenced code block gets put into a nested div, on the same level of indentation, like so:

<div class="enclosing">
<div class="codehilite">
<pre><span></span><code><span class="n">x</span> <span class="o">=</span> <span class="mi">1</span>
</code></pre>
</div>

</div>

And so the regex tries finding <div> blocks by matching against an opening tag and a closing tag. Of course, it matches the closing tag for the nested div and not the second closing tag. This creates something like this, which results in the </div> tag being put into a paragraph:

md5-6c15c5207ae336b3b80cbb077f8b842e


</div>

I am currently brainstorming ideas on how to solve this but it's certainly a headache.

@berndbenner could you attach a markdown code snippet for your issue?

bow commented

@Crozzers If it helps, in my particular example above, indentation level indeed affects the output.

Indenting the innerfenced code block:

<div class="enclosing">
  ```python
  x = 1
  ```
</div>

resulted in the closing </div> being matched correctly. Looking deeper into #462, trying to undo the new lines being added there (or rather a combination of removing specific new lines), also rendered the expected HTML.

To be honest, I am a little unsure if I could add a meaningful solution. HTML is not a regular language, and trying to parse these edge cases by piling on more regex seems like a Sisyphean task. Then again, the codebase is also new to me and there are definitely parts that I do not completely understand yet. So 🤞 ~

I've managed to get a solution mostly working.
My solution is to simply iterate over each line in the text and manually tally up the number of opening/closing tags and then hash the relevant block. It seems to work well but one test is not passing.
The sublist_para test case looks like this:

<p>Some quick thoughts from a coder's perspective:</p>

<ul>
<li><p>The source will be available in a Mercurial ...</p></li>
<li><p>Komodo is a Mozilla-based application...</p>

<ul>
<li>Get a slightly tweaked mozilla build (C++, JavaScript, XUL).</li>
<li>Get a slightly tweaks Python build (C).</li>
<li>Add a bunch of core logic (Python)...</li>
<li>Add Komodo chrome (XUL, JavaScript, CSS, DTDs).</li>
</ul>

<p><p>What this means is that work on and add significant functionality...</p></li>
<li><p>Komodo uses the same extension mechanisms as Firefox...</p></li>
<li><p>Komodo builds and runs on Windows, Linux and ...</p></li>
</ul></p>

But this seems wrong? The final list items should not, in my opinion, be wrapped in an additional <p> tag. When rendering in Firefox it auto corrects to this:

<p></p>
<p>What this means is that work on and add significant functionality...</p></li>
<li><p>Komodo uses the same extension mechanisms as Firefox...</p></li>
<li><p>Komodo builds and runs on Windows, Linux and ...</p></li>
</ul>
<p></p>

So firefox also does not think the final block should be wrapped in a <p> tag.

I'll clean up my code a bit and submit a PR with this test case "fixed" and we'll see what happens