Encode.forJavaScript()
ndp-opendap opened this issue · 7 comments
Greetings,
When I do the following:
String val = "CF-1.5";
String jsVal = Encode.forJavaScript(val);
The ending value of jsVal
is "CF\-1.5". Based on the javadoc for the Encode class here:
https://owasp.github.io/owasp-java-encoder/encoder/apidocs/index.html?index-all.html
I believe that Encode.forJavaScript()
is incorrectly/needlessly escaping the "-" character.
The javadoc does not list "-" as a character to be escaped, and the RFC for JSON:
https://tools.ietf.org/html/rfc7159#section-7
Does not list the "-" character as one that is included in the list of "two-character sequence escape representations of some popular characters".
It seems that since any character can be escaped there is no issue in escaping the "-" character, but I do not think there is a two character shorthand "\-" supported and that the hex escaped version must be used.
Is this a known problem? Has no one else had their JSON parser throw a rod because of this unsupported escape sequence?
Thanks,
Nathan
Perhaps the application of this rule?
if (mode == Mode.BLOCK || mode == Mode.HTML) {
// in <script> blocks, we need to prevent the browser from seeing
// "</anything>" and "<!--". To do so we escape "/" as "\/" and
// escape "-" as "\-".
I think so.
Sorry for the spam above, but I agree with the original post by Nathan and with the analysis by @pthorson.
The encoding of raw -
as raw \-
results in an invalid JSON (but a valid ECMAScript) string literal.
string = quotation-mark *char quotation-mark
char = unescaped /
escape (
%x22 / ; " quotation mark U+0022
%x5C / ; \ reverse solidus U+005C
%x2F / ; / solidus U+002F
%x62 / ; b backspace U+0008
%x66 / ; f form feed U+000C
%x6E / ; n line feed U+000A
%x72 / ; r carriage return U+000D
%x74 / ; t tab U+0009
%x75 4HEXDIG ) ; uXXXX U+XXXX
escape = %x5C ; \
quotation-mark = %x22 ; "
unescaped = %x20-21 / %x23-2E / %x30-5B / %x5D-10FFFF
https://www.rfc-editor.org/errata_search.php?rfc=8259
CharacterEscapeSequence ::
SingleEscapeCharacter
NonEscapeCharacter
SingleEscapeCharacter :: one of
' " \ b f n r t v
NonEscapeCharacter ::
SourceCharacter but not one of EscapeCharacter or LineTerminator
https://www.ecma-international.org/ecma-262/6.0/#sec-literals-string-literals
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String#Escape_notation
https://www.json.org/
There is little need to have the "mode" parameter in the Javascript string encoder. It brings unnecessary usage complexity as encoding for the worst case would still be parseable in the easiest use case as well.
In addition, the existing code missed extra attack surfaces,
This suggests to encode with raw \uHHHH
the following single characters, regardless of "mode" (and add another method signature without the parameter, deprecating the current method). This will add to the existing encodings of the double quote raw "
as raw \"
et c. but remove the ECMA compatible non-JSON encoding raw \-
. The encoding of the forward slash raw /
as raw \/
appears compatible with both ECMA and JSON. To sum up, these characters need replacing on output.
- the double quote
raw "
withraw \"
against escaping a javascript string surrounded by double quotes, - the single quote
raw '
withraw \u0027
against escaping a javascript string surrounded by single quotes, - the forward slash
raw /
withraw \/
against constructing a closing script tag in a javascript string, - the opening angle bracket
raw <
against closing the script tag or opening a comment tag, - the closing angle bracket
raw >
against closing a possible CDATA wrapper around the javascript text embedded in XHTML (and HTML?), U+2028
andU+2029
allowed in JSON but not in Javascript against disrupting the javascript engine.- the ampersand
raw &
against the mandatory HTML entity decoding in XHTML documents embedding scripts without a CDATA wrapper. (HTML documents do not apply the entity decoding to embedded scripts but the suggested string literal encoding does not change the string's value).
I understand that these layers look secondary to JSON or Javascript string literal encoding per se, but leaving the task of protecting the string literal against these layers would require re-implementing the encoder for each use case. Attempting to apply additional encoding with simple .replace() chaining will suffer from destroying the context of the backslash-protected characters. Luckily, the suggested "preliminary optimization" by using the raw \uHHHH
encoding is both acceptable by the lower level interpreter (javascript engine) and defending against additional interpreters on top of it.
This seems like a good argument. Would you care to give us a PR? I'll also as Jeff to take a look at this.
PS: The earlier spam was accidental and I deleted it. I'm back to updating this project.
After speaking to Jeff on this, we have the following response and wish to close the issue and open a new feature request.
Summary: encodeForJavaScript is not meant to be safe for JSON, but we do plan to add a new JSON encode function, in the meantime consider https://github.com/yahoo/serialize-javascript type logic to safely embed JSON on a page.