char regex .*{0,1} is incorrect (antisamy.xml)

Question

char regex .*{0,1} is incorrect (antisamy.xml)

GoogleCodeExporter opened this issue 10 years ago · 1 comments

GoogleCodeExporter commented 10 years ago

Looking at antisamy.xml, SVN revision 137:

<attribute name="char">
<regexp-list>
    <regexp value=".*{0,1}"/>
</regexp-list>              
</attribute>

I think the intent is to allow zero or one character, as described at
http://www.w3.org/TR/html401/types.html#type-character.
If that's the intent, the regex should be ".{0,1}".

To be 100% correct, however, the regex should also allow character
references, including numeric character references such as &#229; or
&#x3072; (see http://www.w3.org/TR/html401/charset.html#h-5.3.1) and
character entity references such as &lt; or &quot; (see
http://www.w3.org/TR/html401/charset.html#h-5.3.2 and
http://www.w3.org/TR/html401/charset.html#entities).

Original issue reported on code.google.com by danr...@gmail.com on 23 Dec 2009 at 8:41

Answer 1 · 2015-04-22T12:04:52.000Z

At runtime this will be enforced correctly. The """ will be treated as a single
character. I confirmed it with the following test case:

String s = "<td char='.'>test</td>";
CleanResults cr = as.scan(s, policy);
assertTrue(cr.getCleanHTML().indexOf("char") > -1 );

s = "<td char='..'>test</td>";
cr = as.scan(s, policy);
assertTrue(cr.getCleanHTML().indexOf("char") == -1 );

s = "<td char='"'>test</td>";
cr = as.scan(s, policy);
assertTrue(cr.getCleanHTML().indexOf("char") > -1 );

s = "<td char='"a'>test</td>";
cr = as.scan(s, policy);
assertTrue(cr.getCleanHTML().indexOf("char") == -1 );

s = "<td char='"&'>test</td>";
cr = as.scan(s, policy);
assertTrue(cr.getCleanHTML().indexOf("char") == -1 );

Original comment by arshan.d...@gmail.com on 8 Mar 2010 at 5:54

Changed state: Invalid