jgm/pandoc

Syntax Highlighting not working in HTML (CSS not being picked up)

tajmone opened this issue · 14 comments

There's something wrong with the generated CSS for syntax highlighting.

With pandoc versions 3.1.13 and 3.2 (under MS Win 10 x64) I can't see the code being colored in the generated HTML document, even though I'm using the default template (standalone document) and have tried specifying different highlighting themes.

When I inspect the source code, the CSS covering code highlighting is there, but when I inspect the single lines of code in Chrome Inspector it doesn't report any CSS being associated to the various span classes of code elements — as if they aren't being picked up by the stylesheet at all. It's not an issue that they are being overwritten due to specificity issues, they simply are not there.

I'm not a CSS expert, so I can't figure out why Chrome is not (no longer) picking up the syntax highlighting definitions for those spans, but something is definitely wrong since I haven't encountered this problem before.

Anyone else experiencing this?

jgm commented

No, everything is working correctly for me.
Do you want to share the generated HTML file?

No, everything is working correctly for me.

That's really strange, I carried out quite extensive tests.

Later on, I tweaked the CSS (added a CSS framework and just the syntax highlighting part of the default pandoc stylesheet) and syntax highlighting was working correctly.

Do you want to share the generated HTML file?

Sure, it's just a sample page to test pandoc markdown against custom templates (except that right now it doesn't use any template).

GitHub doesn't allow attaching HTML files, so I've parked it on a free service (30 days then expires):

https://ufile.io/axn91a9d

jgm commented

Sorry, this service requires me to do various other things to download the file, and I'm not going to do that. Just add a .txt extension to the HTML file and upload it here.

Sorry, this service requires me to do various other things to download the file, and I'm not going to do that. Just add a .txt extension to the HTML file and upload it here.

I apologize, I didn't realize that.

In the meantime, I realized what the problem is. In my project I was filtering pandoc output through HTML Tidy (forgot about that, I was using an old Rake setup), which strips off the <![CDATA[ from the embedded stylesheet:

  <style>
  <![CDATA[

which might explain why Chrome wasn't picking up the styles at all in the source inspector.

Personally, I used Tidy a lot with pandoc projects, mostly because the HTML generated by pandoc tends to have stray indentation. Tidy is a reliable validator, so I still think that the fact that some CSS style elements don't work after going through Tidy is (or might be) an issue.

What's the reason behind the <![CDATA[ wrapper around the CSS? For some reason Tidy seems to complain about it and strip it away. Ideally, the CSS should work also without the <![CDATA[. It's true that Tidy hasn't bee updated in a while and might be lagging behind latest HTML/CSS developments (e.g. it complains about start-num in lists, which is now back again in official HTML).

I'm attaching two files (with .txt extension added, as suggested):

jgm commented

Pandoc doesn't insert CDATA there. What version of pandoc are you using?
What command line, exactly, are you using to produce this HTML?

jgm commented

Actually it's the reverse of what you report: the version processed with tidy has the CDATA, so that must be inserted by tidy.
Closing.

The Problem: Missing type= attribute

Actually it's the reverse of what you report:

Sorry about that!

Anyhow, I did some further tests after reading this and found some posts on the topic about why Tidy adds the <![CDATA[ part.

It only does that with JavaScript and CSS if you omit the type= attribute:

<style type="text/css">
<script type="text/javascript">

I've tweaked the source of the default pandoc HTML template accordingly, and now Tidy wraps the <![CDATA[ part within comments (I believe as a hack for backward compatibility with old browsers) and now the CSS is working fine even with Tidy:

  <style type="text/css">
  /*<![CDATA[*/
    html {
      color: #1a1a1a;

I think it would be a good idea to tweak the HTML5 default template to include the type= attribute within the <style> and <script> tags, because Tidy is right (in fact it reported it as a problem it had to fix). See:

Tidy is considered the de facto standard when it comes to HTML linting and validation, and in this case Tidy was right about the complaint since adding the type= attribute is best practice.

jgm commented

Tidy is not right.

type= is needed in HTML4, but not HTML5.
In fact, it is deprecated in HTML5 and "should not be provided":
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/style

tidy is not HTML5 aware; you should use a different tool to validate HTML5.

I'm using HTML Tidy for HTML 5.

jgm commented

I didn't know there was an HTML5 version of tidy.
In any case, see the docs I linked to, which clearly state that type should not be used in this case for HTML5. Maybe you should report a bug to tidy?

jgm commented

OK, here's the clue:
htacg/tidy-html5#660

tidy is seeing this as an XHTML5 file (not unreasonably since we try to produce polyglot HTML). And it's adding CDATA to ensure XML conformance. You can defeat this by using the -ashtml flag with tidy.

jgm commented

Now, here's the interesting thing. If you add type="text/css", it still adds the CDATA, but it puts it in a CSS comment, and that's why it doesn't screw things up for browsers. Without the type="text/css", it doesn't do this -- I'm not sure why not, because text/css is supposed to be the default in HTML5.

Without the type="text/css", it doesn't do this -- I'm not sure why not, because text/css is supposed to be the default in HTML5.

I couldn't come up with an answer either, but my best guess is that it's just a hack for backward compatibility with old browsers (e.g. IE7).

Development of the new HTML Tidy version targeting HTML5 seems to have stalled since a couple of years — the "new version" being a revamped implementation of the parser — but in the meantime the "old" (classic) version was indeed updated to support HTML5, although it hasn't caught up with the latest HTML5 specs changes of the last couple of years (e.g. start number in ordered lists, first deprecated then re-instated, where Tidy still considers it deprecated). So, the latest Tidy is HTML5 compliant, but lags behind the standard two years or so (and I don't think there will be any further updates until the new parsing engine is completed).

Although the type= attribute is deprecated, it's still usable (using it is not considered invalid); probably this is one of these "transitional changes" that might take time to come into effect in real practice; I noticed that many modern CSS libraries and HTML templates still use it in their HTML code, so I'm assuming that if they cling to it there might be some utility in doing so.

Unfortunately, HTML Tidy won't just reformat the indentation of HTML without altering the contents, it always applies changes — most default settings enforce clean-up and optimizations of all sorts, and even by switching off all of them via command line options or a configuration file, Tidy will still enforce some changes which are hardwired into the program. I gather that Tidy was designed mainly as a validation tool for manually written HTML documents, and not really for procedurally generated HTML docs. It's possible to "teach" Tidy about new tags via configurations, which allows some margin to handle custom tags properly since you can instruct Tidy on how to treat them (inline, block, pre, etc.), but I'm not sure this could be used to override the behavior of known tags.

I couldn't find a cross-platform single-binary command line tool for reformatting HTML (either than NPM packages, which would add unwanted dependencies to my projects). I think that since Tidy has been the standard tool for the job for so many years, most people simply didn't invest energy in a competitor project.

The main reason why I added Tidy to my project is to improve readability of the generated HTML, which I often need to inspect to check if my custom Lua filters work as expected. I should probably remove Tidy from my toolchain, and find a way to pretty indent-reformat the HTML code directly in the editor instead, at inspection time.

I still think that this issue is worth being taken into consideration, if anything because HTML Tidy remain the number one choice when it comes to validating/reformatting HTML pages, so chances are that pandoc users who'd like to polish their final HTML docs might end up using Tidy anyway (like I did). The way Tidy breaks the default template stylesheet is quite subtle, and might pass unnoticed and/or take considerable time to discover, since it can be quite confusing.

jgm commented

type= has been optional in HTML5 for at least 10 years...
Just use the -ashtml flag with tidy as I suggested.