erikrose/mediawiki-parser

Allowed HTML tags

Closed this issue · 2 comments

Things like table, i, b, etc. The list of allowed tags should be modifiable by the caller. (Worst case, we can use the extension-by-SortedDict technique and provide a helper procedure to take tag names and spit out PEG syntax.) Misbalanced tags should get recovered in some sane way, like MW currently does.

Rating this as low-priority, as we might be able to just use the bleach library for this. We'll see.

Extension-by-SortedDict is not going to work until we cure pijnu of its code generation habits (erikrose/pijnu#2), but that's a pretty last-ditch way to do it anyway. A better way, supported by my latest commit to pijnu (erikrose/pijnu@494f53d), might be to capture all HTML-like tags in the parser and then pass in a custom-built callback which either strips the tags or retains them (perhaps to balance in a later pass).

This is implemented and tested. Closing.