snarfed/granary

newlines in <pre> tags shouldn't get removed

sknebel opened this issue · 7 comments

Just noticed when reading https://aaronparecki.com/2018/02/07/7/indieauth in my feedreader (@aaronpk uses granary to provide an Atom feed) that the contents of <pre> tags also get their newlines stripped and thus the code examples are missing them.

Given that granary doesn't seem to need to parse the HTML anywhere I totally get if this is WON'T FIX. Given that the microformats parser returns the newlines as they are it also seems to be the wrong place to handle this(?)

(Ref #80 for why newlines are stripped)

After further reading it seems like if at the end of the process the feed generation could know if the source text is HTML or plain text this could be solved by keeping HTML unmodified, but the Activitystreams 1 format does not support keeping that distinction? Changing this seems more realistic, but still quite a bit of effort.

'content': get_html(prop.get('content')),

Just want to point out that I may have some problems with my own HTML/newline handling right now. Right now my HTML has newlines but no <br> tags, and I use css to get the whitespace to show up right. That means any consumers treating it as HTML will not see the newlines, since literal newlines in HTML are not significant. I think I'm going to need to update how my site handles newlines in general.

thanks for filing @sknebel, and for the in depth sleuthing! whee, whitespace handling. always entertaining. i'll take a look soon.

for my own notes: @aaronpk may be right above about his HTML in general, but for this specific case, the offending content is indeed inside <pre>s, which granary could still theoretically detect and preserve.

Some more thoughts, both assuming keeping AS1 as the central format:

  1. Since AS1 generally assumes HTML for content, plain text properties could be turned to HTML on the input conversion in a way that transparently converts back on text-only outputs.

  2. Several Python templating libraries have a concept of special string interface for HTML (e.g. available as Jinja.Markup or in MarkupSafe) which does not get escaped on output, so the object could know if it contains HTML or not.

i don't regret tackling this just yet...but i'm sure i will eventually. 🤣

thanks again for the ideas @sknebel. i handled this by adding a custom content_is_html property to AS1 when we generate it from HTML, and then use that to determine whether to strip newlines. we'll see what else breaks.