xerofun/owaspantisamy

Ability to encode unknown tags without balancing them.

Closed this issue · 8 comments

Antisamy has the "onUnknownTag" directive where you can set to encode
whatever is not recognized. However Cyberneko will balance the tag before
it's given to Antisamy.

Example, cleaning the string:
"<b>hello</b> <world> !"
Will return:
<b>hello</b> &lt;world&gt; !&lt;/world&gt;

However I'd like the ability to recognize that "<world>" is not an html tag
and just encode it instead of removing it and without having it balanced.

I'd like to get:
"<b>hello</b> &lt;world&gt; !";

A use case would be a text input where the user enters plain text but the
output is rendered into HTML. We would like to display literally whatever
the user input (for instance, "<world>") and let the browser pick up markup
that is considered safe "<b>").

This is easy to accomplish. Before sending the html string to the
DOMFragmentParser, we can encode the open bracket for any unrecognized tag.
An unrecognized tag would be a tag that was not explicitly defined in the
policy.

I attach a patch file with the suggested changes.

Original issue reported on code.google.com by carlos.a...@gmail.com on 26 Nov 2008 at 6:24

Attachments:

I'm not necessarily against this patch, it does put more responsibility on 
AntiSamy
users. So right now, if a user inputs the following text:

Hey mate, how's your leg? <g>

The <g>, obviously meant to be a grin abbreviation, will be balanced and then
filtered - so, poof, it's gone. But - with the patch, it'll say. So if an 
AntiSamy
user doesn't have <script> in their policy file - your suggestion is that the
<script> tag appear encoded? While safe, it is kind against the idea that the 
policy
only dictates what you want, and not what you don't want (whitelist versus
blacklist). Anybody care to chime in here?

Arshan

Original comment by arshan.d...@gmail.com on 5 Dec 2008 at 6:28

  • Added labels: Priority-Low, Type-Enhancement
  • Removed labels: Priority-Medium, Type-Defect
I agree with Arshan that white-listing would be the preferred way.  Maybe we 
could
have the option of adding action="encode" to pre-defined tags like <g> and 
<world> in
 the policy files?

Original comment by phlogist...@gmail.com on 29 Dec 2008 at 11:43

OK. This feature will be added by the next major release.

Original comment by arshan.d...@gmail.com on 21 Jan 2009 at 5:00

  • Changed state: Accepted

Original comment by arshan.d...@gmail.com on 21 Jan 2009 at 5:00

Original comment by arshan.d...@gmail.com on 17 Mar 2009 at 2:23

  • Changed state: Fixed

Original comment by arshan.d...@gmail.com on 3 Aug 2009 at 2:44

  • Changed state: Verified
In which version it was fixed? How it should work right now? 

I still have this issue with custom tag being balanced when onUknownTag is set 
to "encode".

Original comment by jacek.ja...@gmail.com on 29 Jul 2014 at 9:50

I m using 1.4.5 and still seeing following issues with onUknownTag is set to 
"encode":
1. custom tags being balanced.
2. custom tags with @& or other "technical" invalid characters are being 
balanced and modified. For example
input: <test@gmail.com>my test
output: <test>my test</test>

input: <test> my test
output: <test>my test</test>

Original comment by ranimat...@gmail.com on 21 Oct 2014 at 5:04