canonicalize sees entity which isn't there

Question

canonicalize sees entity which isn't there

bardware opened this issue a year ago · 7 comments

Describe the bug
I call canonicalize on a string and pass the resulting string to encodeForJavaScript. Characters show up, that are not supposed to be there

Specify what ESAPI version(s) you are experiencing this bug in
I tried various versions including the latest 2.5.2.

To Reproduce
I have this Java program

import org.owasp.esapi.ESAPI;
import org.owasp.esapi.Encoder;

public class BardwareEsapi {

  public static void main(String[] args) {
    System.out.println( ESAPI.encoder().encodeForJavaScript( ESAPI.encoder().canonicalize( "fit&gesund" ) ) );
  }
}

I compile it like javac.exe -cp esapi-2.5.2.0.jar BardwareEsapi.java
and run it like java.exe -cp .;esapi-2.5.2.0.jar;log4j-1.2.17.jar BardwareEsapi

Expected behavior
On the commandline I want to read the input string fit&gesund. However, I see fit\u2265sund. \u2265 stands for Greater-Than Or Equal.
I guess, the substring &ge is interpreted as the HTML entity &ge which is not there as there is no semicolon.

Platform environment (please complete the following information):

Windows; 64 bit, German
JAVA 17

Additional context
I originally noticed this issue on Adobe Coldfusion where I call encodeForJavaScript( "fit&gesund", true ) and get the corrupted string. I expected this to be an issue, because ColdFusion comes with an elder version of your library but it seems this issue exists in the most recent version, too.

Answer 1 · 2023-08-05T01:09:07.000Z

@bardware - I don't think this is a bug. I think you are using it incorrectly. You really should NOT be calling Encoder.canonicalize() here. It's important to call canonicalize before you doing validation, but not when you are doing output encoding.

If I pass the input of "fit&gesund" to Encoder.canonicalize(), it will return the string "fit≥sund", which is where your "≥" arises from. Calling Encoder.encodeForJavaScript() on that then returns "fit\u2265sund". However, if you call Encoder.encodeForJavaScript() directly on the string "fit&gesund", it will return the string "fit\x26gesund", which makes sense as '&' is 0x26 in ASCII and UTF-8.

If you agree, then please close this ticket. If you disagree, then please explain your rationale for first calling Encoder.canonicalize before doing output encoding.

Answer 2 · 2023-08-05T02:33:00.000Z

Kevin did most of the heavy lifting.

Probably part of the problem is that you GOT the value using Validator.getValidInput() which IIRC runs a canonicalization on it.

I've long felt that the semantics of that part of the API were terrible. Because it encourages storage of the canonicalized input.

Proper practice would be to canonicalize, validate, and then discard the validated version and store the original input (because canonicalization is destructive.)

I think we should update the documentation of the getValid methods to make the intention more clear.

After 14yrs this feels like a "code smell" like "getValidSafeHTML()"

At any rate, I don't think this is a bug because browsers will render HTML entities without the semicolon (last tested about three months back) and the safe default because of that would be to reduce them whenever discovered. Browser Vendors are the ones forcing that.

If you think THAT is an ESAPI issue, ESAPI is BSD licensed and you can write your own HTMLEntityCodec that strictly only detects with the semicolons. However if that causes security issues for you, that's on you as you're deviating from a safe default.

Answer 3 · 2023-08-05T03:41:05.000Z

What Matt didn't say was the the ESAPI output encoders were *intentionally* written to provide a "safe harbor" to users who used a popular set of badly broken and vulnerable browsers from a certain company whose first goal was definitely not security and who played fast and loose with the W3C specs. Things aren't quite as bad as they once were in the regard, but ESAPI has still retained that behavior over the years. If you want something that has no safe harbor to protect your users from misbehaving browsers, either write your own HtmlEntityCodec like Matt suggested or switch over to the OWASP Java Encoder Project.

…

Message ID: ***@***.***>

Answer 4 · 2023-08-06T01:45:32.000Z

Exactly. The spec doesn’t matter. The implementation(s) matter. When we redesign the Internet, let’s not have 20 different encoding / escaping schemes in nested contexts.Thanks,—Jeff410-707-1487On Aug 4, 2023, at 8:41 PM, Kevin W. Wall ***@***.***> wrote: What Matt didn't say was the the ESAPI output encoders were *intentionally* written to provide a "safe harbor" to users who used a popular set of badly broken and vulnerable browsers from a certain company whose first goal was definitely not security and who played fast and loose with the W3C specs. Things aren't quite as bad as they once were in the regard, but ESAPI has still retained that behavior over the years. If you want something that has no safe harbor to protect your users from misbehaving browsers, either write your own HtmlEntityCodec like Matt suggested or switch over to the OWASP Java Encoder Project.

Message ID: ***@***.***>

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

Answer 5 · 2023-08-07T21:00:09.000Z

Thank you for your replies.
I got the original behaviour when I used ColdFusion
The EncodeForHTML function takes two arguments. The string and a value that says if canonicalization happens before encoding.
I guess this is supposed to ensure entities are not encoded multiple times. That's how I expect it to work.

Answer 6 · 2023-08-07T22:23:56.000Z

I guess it would be okay to do a canonicalization check before encoding I guess and if you see that the canonicalized string and the original input different, just reject it outright (as an attempted attack, say), but at least for the ESAPI for Java output encoders, feeding the canonicalized output into the encoders as input is going to cause problems like I originally noted. And output encoding is safe (as long as you select the encoder for the appropriate context), even if the string is not in canonical form. Maybe ColdFusion did something differently.

As far as I know, that's how the ESAPI for Java output encoders have always worked, at least going back as far as version 2.0, which is when I first got involved. You can look at 'git blame' on DefaultEncoder.java if you want, but the only recent change was an addition to support encodeForJSON and a bit before that, @xeno6696 added changes that (I think) had to do with encodeForURL (or perhaps that was only addressing the Validator; it's been a while). I think it was related to GitHub Issue #300 though. Matt can expound or correct me if I'm wrong or forgetting something. Other than that, I think there's just been a few bug fixes, but nothing else major.

Answer 7 · 2023-11-25T14:50:01.000Z

@bardware wrote:

I got the original behaviour when I used ColdFusion
The EncodeForHTML function takes two arguments. The string and a value that says if canonicalization happens before encoding.
I guess this is supposed to ensure entities are not encoded multiple times. That's how I expect it to work.

@planetlevel can correct me if I am wrong, but I believe that technically ESAPI for Cold Fusion has never been an OWASP project and it is in fact based on ESAPI for Java 1.4 or earlier. Lots of things have changed in 2.0. If you are getting this or a similar error in ESAPI for Cold Fusion, then this bug report should be filed with Adobe, not here.

I am marking this as 'wontfix' and am closing it because I believe that is the misapplication of doing the canonicalization that is causing the perceived problem, and as previously mentioned, you should not be doing that.