webplatform/mediawiki-conversion

Clean up code sample generated by MediaWiki SyntaxHighlight GeSHi extension to get raw code

renoirb opened this issue · 5 comments

Code sample in html output when using <syntaxHighlight> in wiki content;

<syntaxHighlight>
<div class="container">
    <div class="box bottom">This box is at the bottom with z-index set to auto.</div>
    <div class="box middle">This box is in the middle with z-index set to auto.</div>
    <div class="box top">This box is at the top with z-index set to auto.</div>
</div>
</syntaxHighlight>

Becomes this in the generated HTML;

<div>
 <p><span class="language">HTML</span>
 </p>
 <pre>
<div dir="ltr" class="mw-geshi mw-code mw-content-ltr"><div class="html5 source-html5"><pre class="de1"><span class="sc2">&lt;<span class="kw
2">div</span> <span class="kw3">class</span><span class="sy0">=</span><span class="st0">&quot;container&quot;</span>&gt;</span>
    <span class="sc2">&lt;<span class="kw2">div</span> <span class="kw3">class</span><span class="sy0">=</span><span class="st0">&quot;box to
p&quot;</span>&gt;</span>This box is at the top with z-index set to 30.<span class="sc2">&lt;<span class="sy0">/</span><span class="kw2">div</span>&gt;</span>
    <span class="sc2">&lt;<span class="kw2">div</span> <span class="kw3">class</span><span class="sy0">=</span><span class="st0">&quot;box middle-level-one&quot;</span>&gt;</span>This box is in the middle level 1 with z-index set to 20.<span class="sc2">&lt;<span class="sy0">/</span><span class="kw2">div</span>&gt;</span>
    <span class="sc2">&lt;<span class="kw2">div</span> <span class="kw3">class</span><span class="sy0">=</span><span class="st0">&quot;box middle-level-two&quot;</span>&gt;</span>This box is in at middle level 2 with z-index set to 20.<span class="sc2">&lt;<span class="sy0">/</span><span class="kw2">div</span>&gt;</span>
    <span class="sc2">&lt;<span class="kw2">div</span> <span class="kw3">class</span><span class="sy0">=</span><span class="st0">&quot;box bottom&quot;</span>&gt;</span>This box is at the bottom with z-index set to 10.<span class="sc2">&lt;<span class="sy0">/</span><span class="kw2">div</span>&gt;</span>
<span class="sc2">&lt;<span class="sy0">/</span><span class="kw2">div</span>&gt;</span></pre></div></div>

Which makes it hard to work with code samples within a static site.

The desired output for static site generator, so we can use a syntax highlighter out of the box, is:

<pre class="language-html5" data-lang="html5">
<div class="container">
    <div class="box bottom">This box is at the bottom with z-index set to auto.</div>
    <div class="box middle">This box is in the middle with z-index set to auto.</div>
    <div class="box top">This box is at the top with z-index set to auto.</div>
</div>
</pre>

Solution path: change MediaWiki GeSHi SyntaxHighlight extension with this patch;

From c602156d811f714631670a6a45a66e3848716571 Mon Sep 17 00:00:00 2001
From: Renoir Boulanger <renoir@w3.org>
Date: Fri, 7 Aug 2015 21:04:46 -0400
Subject: [PATCH] Superseed GeSHi to return same as what the rest does

---
 mediawiki/extensions/SyntaxHighlight_GeSHi/SyntaxHighlight_GeSHi.class.php | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mediawiki/extensions/SyntaxHighlight_GeSHi/SyntaxHighlight_GeSHi.class.php b/mediawiki/extensions/SyntaxHighlight_GeSHi/SyntaxHighlight_GeSHi.class.php
index ddaea80..a3589d9 100644
--- a/mediawiki/extensions/SyntaxHighlight_GeSHi/SyntaxHighlight_GeSHi.class.php
+++ b/mediawiki/extensions/SyntaxHighlight_GeSHi/SyntaxHighlight_GeSHi.class.php
@@ -57,6 +57,9 @@ class SyntaxHighlight_GeSHi {
            }
        }
        $lang = strtolower( $lang );
+
+       return sprintf("\n<pre class=\"language-%s\" data-lang=\"%s\">\n%s\n</pre>\n", $lang, $lang, $text);
+
        if( !preg_match( '/^[a-z_0-9-]*$/', $lang ) ) {
            $error = self::formatLanguageError( $text );
            return $error;
--

Search and find pages from exported content that has mw-geshi in the exported HTML to create a data/missed.yml file for another mediawiki:run 3 pass.

grep -rli mw-geshi out/content > data/missed-geshi.yml

in vim, I then sort and format them ;

:sort
:%s/\.md$//g
:%s/\/index$//g
:%s/^/  - /g

Then prepend at the beginning of the file:

missed:

End result looks like;

missed:
  - apis/media_source_extensions/MediaSource/addSourceBuffer
  - apis/media_source_extensions/MediaSource/appendBuffer
  - apis/vibration
  - WPD/Annotations
  - WPD/Browser_Testing/QuirksMode

Notice that the previous grep command crawls everything in out/content/ is by default considered as what is commited in webplatform/docs repository.

If you cloned webplatform/docs-meta in out/content/Meta/ and webplatform/docs-wpd in out/content/WPD/, you’ll see them too.

REMEMBER that mediawiki:run writes into out/ regardless of what it has. If you want to re run also for what’s docs-meta, or docs-wpd you’ll have to make sure you have the right export data first. Then specify it in the mediawiki:run command.

For example;

mv out out-main
mv out-main/WPD out
app/console mediawiki:run 3 --missed --xml-source=dumps/wpd.xml

If you need to ensure MediaWiki gives you out the most recent code, you can send a purge by using mediawiki:refresh-pages. Notice that this command don’t impact the content of the out/ folder.

You can send refresh without worries by doing

app/console mediawiki:refresh-pages --xml-source=dumps/wpd.xml

Issue here isn't limited to syntaxHighlight and Syntax_GeSHI. Any code sample may break during import. Work has to be done during conversion pass to ensure its encoded into htmlentities up until pandoc does the conversion.

Updated patch made to SyntaxHighlight_GeSHi.class.php

From e7c5677ca0d78601573990ee4c6fbcb734bbc645 Mon Sep 17 00:00:00 2001
From: Renoir Boulanger <hello@renoirboulanger.com>
Date: Fri, 4 Sep 2015 19:17:14 -0400
Subject: [PATCH] webplatform/mediawiki-conversion#19 patch

---
 SyntaxHighlight_GeSHi.class.php | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/SyntaxHighlight_GeSHi.class.php b/SyntaxHighlight_GeSHi.class.php
index ddaea80..d6462f2 100644
--- a/SyntaxHighlight_GeSHi.class.php
+++ b/SyntaxHighlight_GeSHi.class.php
@@ -57,6 +57,12 @@ class SyntaxHighlight_GeSHi {
            }
        }
        $lang = strtolower( $lang );
+
+       # webplatform/mediawiki-conversion#19
+       $lang = str_replace(['markup', 'html5'], 'html', $lang);
+       $lang = str_replace(['javascript', 'script'], 'js', $lang);
+       return sprintf("\n<pre class=\"language-%s\">\n%s\n\n</pre>\n", $lang, htmlentities($text));
+
        if( !preg_match( '/^[a-z_0-9-]*$/', $lang ) ) {
            $error = self::formatLanguageError( $text );
            return $error;
--
2.4.2

Another iteration of the patch.

But this time, covers the following issues:

  1. Won't break contents within the code block that isn't ascii (i.e. chinese text in comments). Otherwise it would remove the full code block.
  2. Don't escape twice while attempting to escape.
  3. Allow to use MediaWiki parser tag to escape, not just syntax highlight. Much useful if you need to make sure a wiki transcluded template may contain code that isn't always escaped.

Patch

From 363a42c5d3314445e0f35713cc421f767d3f4a82 Mon Sep 17 00:00:00 2001
From: Renoir Boulanger <renoir@w3.org>
Date: Wed, 16 Sep 2015 13:04:59 -0400
Subject: [PATCH] Required change to solve webplatform/mediawiki-conversion#19

---
 SyntaxHighlight_GeSHi.class.php | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/SyntaxHighlight_GeSHi.class.php b/SyntaxHighlight_GeSHi.class.php
index d179e77..d7ede30 100644
--- a/SyntaxHighlight_GeSHi.class.php
+++ b/SyntaxHighlight_GeSHi.class.php
@@ -59,6 +59,14 @@ class SyntaxHighlight_GeSHi {
                        }
                }
                $lang = strtolower( $lang );
+
+               # RBx webplatform/mediawiki-conversion#19
+               $lang = str_replace(['markup', 'html5'], 'html', $lang);
+               $lang = str_replace(['javascript', 'script'], 'js', $lang);
+               $escaped = htmlspecialchars($text, ENT_COMPAT|ENT_HTML401, ini_get("default_charset"), false);
+               return sprintf("\n<pre class=\"language-%s\">\n%s\n\n</pre>\n", $lang, $escaped);
+               # /RBx
+
                if( !preg_match( '/^[a-z_0-9-]*$/', $lang ) ) {
                        $error = self::formatLanguageError( $text );
                        wfProfileOut( __METHOD__ );
--
1.9.1

Example of syntax escaping from a transcluded template

A Template:Single_Example code sample MediaWiki template

<noinclude>
A block for a single example. Automatically wraps in syntax highlighting.

'''If you manually include an inline-example code block, do not use this template; use [[Template:Inline Example]] instead.''' The Examples section in many article types will automatically use this template.
<pre>
{{Single Example
|Code=
|LiveURL=
|Language=
|Description=
}}
</pre>
{{TODO | Use prism.js for syntax highlighting}}
</noinclude><includeonly>
{{{Description

|}}}
{{#ifeq:  {{{Language|Markup}}}  |  Markup  |  {{#set:Language=html}}  }}
<div class="example">
{{#tag:syntaxHighlight
  |{{{Code|}}}
  |lang={{{Language|}}}
}}
{{#if: {{{LiveURL|}}} | [{{{LiveURL|}}} View live example]
|}}
</div>
</includeonly>