canonicalize sees entity which isn't there
bardware opened this issue · 7 comments
Describe the bug
I call canonicalize
on a string and pass the resulting string to encodeForJavaScript
. Characters show up, that are not supposed to be there
Specify what ESAPI version(s) you are experiencing this bug in
I tried various versions including the latest 2.5.2.
To Reproduce
I have this Java program
import org.owasp.esapi.ESAPI;
import org.owasp.esapi.Encoder;
public class BardwareEsapi {
public static void main(String[] args) {
System.out.println( ESAPI.encoder().encodeForJavaScript( ESAPI.encoder().canonicalize( "fit&gesund" ) ) );
}
}
I compile it like javac.exe -cp esapi-2.5.2.0.jar BardwareEsapi.java
and run it like java.exe -cp .;esapi-2.5.2.0.jar;log4j-1.2.17.jar BardwareEsapi
Expected behavior
On the commandline I want to read the input string fit&gesund
. However, I see fit\u2265sund
. \u2265
stands for Greater-Than Or Equal
.
I guess, the substring &ge
is interpreted as the HTML entity &ge
which is not there as there is no semicolon.
Platform environment (please complete the following information):
- Windows; 64 bit, German
- JAVA 17
Additional context
I originally noticed this issue on Adobe Coldfusion where I call encodeForJavaScript( "fit&gesund", true )
and get the corrupted string. I expected this to be an issue, because ColdFusion comes with an elder version of your library but it seems this issue exists in the most recent version, too.
@bardware - I don't think this is a bug. I think you are using it incorrectly. You really should NOT be calling Encoder.canonicalize()
here. It's important to call canonicalize
before you doing validation, but not when you are doing output encoding.
If I pass the input of "fit&gesund" to Encoder.canonicalize()
, it will return the string "fit≥sund", which is where your "≥" arises from. Calling Encoder.encodeForJavaScript() on that then returns "fit\u2265sund". However, if you call Encoder.encodeForJavaScript()
directly on the string "fit&gesund", it will return the string "fit\x26gesund", which makes sense as '&' is 0x26 in ASCII and UTF-8.
If you agree, then please close this ticket. If you disagree, then please explain your rationale for first calling Encoder.canonicalize
before doing output encoding.
Kevin did most of the heavy lifting.
Probably part of the problem is that you GOT the value using Validator.getValidInput()
which IIRC runs a canonicalization on it.
I've long felt that the semantics of that part of the API were terrible. Because it encourages storage of the canonicalized input.
Proper practice would be to canonicalize, validate, and then discard the validated version and store the original input (because canonicalization is destructive.)
I think we should update the documentation of the getValid
methods to make the intention more clear.
After 14yrs this feels like a "code smell" like "getValidSafeHTML()
"
At any rate, I don't think this is a bug because browsers will render HTML entities without the semicolon (last tested about three months back) and the safe default because of that would be to reduce them whenever discovered. Browser Vendors are the ones forcing that.
If you think THAT is an ESAPI issue, ESAPI is BSD licensed and you can write your own HTMLEntityCodec
that strictly only detects with the semicolons. However if that causes security issues for you, that's on you as you're deviating from a safe default.
Thank you for your replies.
I got the original behaviour when I used ColdFusion
The EncodeForHTML function takes two arguments. The string and a value that says if canonicalization happens before encoding
.
I guess this is supposed to ensure entities are not encoded multiple times. That's how I expect it to work.
I guess it would be okay to do a canonicalization check before encoding I guess and if you see that the canonicalized string and the original input different, just reject it outright (as an attempted attack, say), but at least for the ESAPI for Java output encoders, feeding the canonicalized output into the encoders as input is going to cause problems like I originally noted. And output encoding is safe (as long as you select the encoder for the appropriate context), even if the string is not in canonical form. Maybe ColdFusion did something differently.
As far as I know, that's how the ESAPI for Java output encoders have always worked, at least going back as far as version 2.0, which is when I first got involved. You can look at 'git blame' on DefaultEncoder.java if you want, but the only recent change was an addition to support encodeForJSON
and a bit before that, @xeno6696 added changes that (I think) had to do with encodeForURL
(or perhaps that was only addressing the Validator; it's been a while). I think it was related to GitHub Issue #300 though. Matt can expound or correct me if I'm wrong or forgetting something. Other than that, I think there's just been a few bug fixes, but nothing else major.
@bardware wrote:
I got the original behaviour when I used ColdFusion
The EncodeForHTML function takes two arguments. The string and a value that says ifcanonicalization happens before encoding
.
I guess this is supposed to ensure entities are not encoded multiple times. That's how I expect it to work.
@planetlevel can correct me if I am wrong, but I believe that technically ESAPI for Cold Fusion has never been an OWASP project and it is in fact based on ESAPI for Java 1.4 or earlier. Lots of things have changed in 2.0. If you are getting this or a similar error in ESAPI for Cold Fusion, then this bug report should be filed with Adobe, not here.
I am marking this as 'wontfix' and am closing it because I believe that is the misapplication of doing the canonicalization that is causing the perceived problem, and as previously mentioned, you should not be doing that.