`\u00{80..ff}` characters in string literals are translated incorrectly in some languages
generalmimon opened this issue · 1 comments
For example, if you parse a UTF-8 string with the U+00A3 POUND SIGN
(£
) character and test it for equality with the "£"
string literal (or equivalently "\u00a3"
), you'll get false
in some target languages.
Below is a reproducible .ksy snippet that assumes a binary input c2 a3
(this is the pound sign encoded in UTF-8 using Python: "\u00a3".encode('utf-8').hex(' ') == 'c2 a3'
, "\u00a3" == '£'
):
meta:
id: str_literals_latin1
seq:
- id: parsed
size: 2
type: str
encoding: UTF-8
instances:
parsed_eq_literal:
value: parsed == "\u00a3"
According to my tests, parsed_eq_literal
will be false
in C++, Go, Lua, Nim, PHP and Ruby. This indicates that in these languages, the string literal "\u00a3"
was translated incorrectly, as it apparently doesn't represent a UTF-8 string with the U+00A3 character (i.e. the pound sign):
$ grep -ri 'parsed_\?eq_\?literal.* =' -B1 | grep -F '\'
cpp_stl_11/str_literals_latin1.cpp: m_parsed_eq_literal = parsed() == (std::string("\243"));
cpp_stl_98/str_literals_latin1.cpp: m_parsed_eq_literal = parsed() == (std::string("\243"));
go/src/test_formats/str_literals_latin1.go: this.parsedEqLiteral = bool(this.Parsed == "\243")
graphviz/str_literals_latin1.dot: <TR><TD>parsed_eq_literal</TD><TD>parsed == "\243"</TD></TR>
lua/str_literals_latin1.lua: self._m_parsed_eq_literal = self.parsed == "\243"
nim/str_literals_latin1.nim: let parsedEqLiteralInstExpr = bool(this.parsed == "\243")
php/StrLiteralsLatin1.php: $this->_m_parsedEqLiteral = $this->parsed() == "\243";
ruby/str_literals_latin1.rb: @parsed_eq_literal = parsed == "\243"
In contrast, in C#, Java, JavaScript, Perl, Python and Rust, the parsed_eq_literal
instance evaluates to true
, so we can say that "\u00a3"
was translated correctly for these target languages:
construct/str_literals_latin1.py: 'parsed_eq_literal' / Computed(lambda this: this.parsed == u"\243"),
csharp/StrLiteralsLatin1.cs: _parsedEqLiteral = (bool) (Parsed == "\u00a3");
java/src/io/kaitai/struct/testformats/StrLiteralsLatin1.java- boolean _tmp = (boolean) (parsed().equals("\243"));
javascript/StrLiteralsLatin1.js: this._m_parsedEqLiteral = this.parsed == "\xa3";
perl/StrLiteralsLatin1.pm: $self->{parsed_eq_literal} = $self->parsed() eq "\243";
python/str_literals_latin1.py: self._m_parsed_eq_literal = self.parsed == u"\243"
rust/str_literals_latin1.rs: *self.parsed_eq_literal.borrow_mut() = (*self.parsed() == "\u{a3}".to_string()) as bool;
To see which characters are affected, see the C1 Controls and Latin-1 Supplement
table at https://en.wikibooks.org/wiki/Unicode/Character_reference/0000-0FFF