char regex .*{0,1} is incorrect (antisamy.xml)
GoogleCodeExporter opened this issue · 1 comments
GoogleCodeExporter commented
Looking at antisamy.xml, SVN revision 137:
<attribute name="char">
<regexp-list>
<regexp value=".*{0,1}"/>
</regexp-list>
</attribute>
I think the intent is to allow zero or one character, as described at
http://www.w3.org/TR/html401/types.html#type-character.
If that's the intent, the regex should be ".{0,1}".
To be 100% correct, however, the regex should also allow character
references, including numeric character references such as å or
ひ (see http://www.w3.org/TR/html401/charset.html#h-5.3.1) and
character entity references such as < or " (see
http://www.w3.org/TR/html401/charset.html#h-5.3.2 and
http://www.w3.org/TR/html401/charset.html#entities).
Original issue reported on code.google.com by danr...@gmail.com
on 23 Dec 2009 at 8:41
GoogleCodeExporter commented
At runtime this will be enforced correctly. The """ will be treated as a single
character. I confirmed it with the following test case:
String s = "<td char='.'>test</td>";
CleanResults cr = as.scan(s, policy);
assertTrue(cr.getCleanHTML().indexOf("char") > -1 );
s = "<td char='..'>test</td>";
cr = as.scan(s, policy);
assertTrue(cr.getCleanHTML().indexOf("char") == -1 );
s = "<td char='"'>test</td>";
cr = as.scan(s, policy);
assertTrue(cr.getCleanHTML().indexOf("char") > -1 );
s = "<td char='"a'>test</td>";
cr = as.scan(s, policy);
assertTrue(cr.getCleanHTML().indexOf("char") == -1 );
s = "<td char='"&'>test</td>";
cr = as.scan(s, policy);
assertTrue(cr.getCleanHTML().indexOf("char") == -1 );
Original comment by arshan.d...@gmail.com
on 8 Mar 2010 at 5:54
- Changed state: Invalid