Python-Markdown/markdown

BlockProcessor output wrapped in p tag

Apreche opened this issue · 9 comments

It seems that a common request is to have a stand-alone image rendered as a figure with a figcaption.

e.g.:

![cool image](https://mydomain.com/cool_image.jpg "a cool image")

should be rendered as

<figure>
<img src="https://mydomain.com/cool_image.jpg" alt="a cool image" />
<figcaption>a cool image</figcaption>
</figure>

I agree that an extension is the correct way to handle this case. I found a small handful of extensions that offer this functionality. However, all of them have a major flaw that makes them completely unacceptable. Their output is incorrectly wrapped in a p tag like this.

<p>
<figure>
<img src="https://mydomain.com/cool_image.jpg" alt="a cool image" />
<figcaption>a cool image</figcaption>
</figure>
</p>

This is not because the extension developers desired the p tag to be there. It seems the p tag can't be avoided because of the way python-markdown BlockProcessor is implemented. The only way I can think of to handle this situation is to make a Postprocessor that looks for any p tags that only contain a figure, and then to remove the wrapping. That is a very ugly solution that would be nice to avoid.

It's also a problem that images sometimes appear as inline elements, when included with other content, but sometimes exist as block elements, when they stand all by themselves. The inline case is already handled perfectly. How can we handle the block case without interfering with images in the inline case?

Some extensions have tried to solve this by making special non-standard markdown syntax. I also find that to be an unacceptable solution.

Is there a way to write an extension to handle this case correctly?

There is a statement here that a block processor gets wrapped with p tags, but that is not a true statement. Block processors do not automatically get wrapped in p tags.

I'm not sure what extensions you've tried and what their approach is, but as you can see below, I can create a figure using a block processor and not have it wrapped in a p tag.

In this example, we use an extension that creates arbitrary HTML wrappers. In this case figure. No p tag is wrapped around the block. The internal content within the block gets run through the Markdown block processors and then gets wrapped in a p tag, but figure does not.

import markdown

MD = """
/// html | figure
![cool image](https://mydomain.com/cool_image.jpg)
///
"""

html = markdown.markdown(
    MD,
    extensions=['pymdownx.blocks.html'],
)

print(html)
<figure>
<p><img alt="cool image" src="https://mydomain.com/cool_image.jpg" /></p>
</figure>

I suspect they are not doing what you think they are doing. I see no reason why an extension could not create a figure without it being wrapped in a p tag when using a block processor. I cannot comment on why they may be having their content wrapped in p tags as I would have to see their implementation to explain why.

@facelessuser Thanks, I think I may not have been clear. That extension manages to avoid the p tag because there is extra non-standard markdown syntax to avoid the issue. If I'm going to be including some extra weird markdown, then I might as well just put the HTML in the markdown directly.

If the markdown is only

![cool image](https://mydomain.com/cool_image.jpg)

With no additional /// html | figure or any other such extra markup permitted, can an extension be written that renders the desired HTML without a <p> tag wrapping it?

That extension manages to avoid the p tag because there is extra non-standard markdown syntax to avoid the issue.

No, that is not true, it is just captured before the paragraph extension captures the block. Again, you haven't really specified how these "other extensions" approach things, but what I'm saying is that if the extension was done properly, you could get what you want.

If your block extension captures the loan image by itself and treats it as a block before the paragraph extension and creates the figure, you can embed the image within and the figure will not be wrapped in a paragraph. This is completely doable, but the extensions you are using are likely not doing that. I'm likely oversimplifying some steps, but there is nothing innate that forces a block to be wrapped in paragraphs, we have many block extensions that are not wrapped in paragraphs.

In short, a block processor that does what you want must treat the loan image as a block before the paragraph extension.

@facelessuser That's great! I will write the extension to do this. How can I ensure that my block extension happens before the paragraph extension?

waylan commented

A few observations.

The default behavior of always wrapping images in <p> tags is a result of the Markdown rules. Markdown is a subset of HTML and therefore does not support all of HTML's features. One feature that is not supported is block-level img elements (note that images are listed under "Span Elements" only, not under "Block Elements" in the document hierarchy). I realize some users don't like this, but we didn't write the rules, we just implement a parser which follows them.

Just because the default behavior is a certain way does not mean that it can't be changed. In fact, any part of the parser can be changed if one makes use of the correct part of the extension API. However, the default behavior will always follow the rules.

I havn't checked, but suspect the various existing extensions that you have tries all use a custom inline processor, which, will always only parse span level content. And that would explain why they always result in the images being wrapped in <p> elements. However, if you implemented a block processor instead, then that would output its own block-level element. Note that the ParagraphProcessor is the fallback block processor. It only gets called if no other block processor has already claimed the block. So, simply write a block processor which correctly identifies and processes your block-level images before they ever get to the ParagraphProcessor and your output will never get wrapped in a <p> tag. The "priority" assigned to each block processor is documented here (or as @facelessuser indicated, you can check the source code).

@waylan @facelessuser Thanks. I'm working on it right now.

One problem I've already run into is if I want to support reference-style block images.

Inline references have no problem because all the blocks in the entire document have been processed before inlines start getting processed. Therefore, even if a bunch of references are at the bottom of the Markdown document, they are all populated in md.references and ready to go.

Even if I put the priority of a block processor lower than the ReferenceProcessor it doesn't help. The images I'm trying to process as blocks have not yet had their references processed, as those references are at the bottom of the Markdown document. This means that it's only going to work for non-reference style images, or if documents happen to have the references above the images, which I think is a rather strange thing to do.

Because references are sort of a special case, is there some way we can scan the entire document for references at the very beginning before any other processing? That way they are ready and populated so that any other processor can refer to them.

waylan commented

Because references are sort of a special case, is there some way we can scan the entire document for references at the very beginning before any other processing?

Yes, you can use a preprocessor. In fact, Markdown used to do that way back.

Although, another possibility is that if you are exclusively using <figure> in your output, then you could create the <figure> tag and leave the image as Markdown for later processing by the inline processors. For example, your block processor could create this:

<figure>
![cool image](https://mydomain.com/cool_image.jpg "a cool image")
<figcaption>a cool image</figcaption>
</figure>

Well actually, I suppose the figcaption would need to also be dealt with later as a reference style image would have the caption defined in the reference. So maybe this then:

<figure>
![cool image](https://mydomain.com/cool_image.jpg "a cool image")
</figure>

But then the issue is that you need to render the image differently (include or exclude the caption) depending on what the parent is (figure or anything else) and there is no way to get the parent from within an inline processor. Although, you could perhaps use the ANCESTOR_EXCLUDES attribute to skip the inline processor. I would have two inline processors. The first one is a replacement for the default and actually is an exact copy of the default with the one difference being that is has ANCESTOR_EXCLUDES set to include 'figure'. Then the second inline processor would not have the ANCESTOR_EXCLUDES set and would insert both the img and the figcaption. So long as the first one is run first, the second will only ever see images which are in figure elements.

I am closing this as there is no actionable item here. If you have any additional support questions about this issue, feel free to add an additional comment. We can continue to have a discussion in the closed issue.