Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

\u00{80..ff} characters in string literals are translated incorrectly in some languages #1133

Open
generalmimon opened this issue Sep 26, 2024 · 1 comment

Comments

@generalmimon
Copy link
Member

generalmimon commented Sep 26, 2024

For example, if you parse a UTF-8 string with the U+00A3 POUND SIGN (£) character and test it for equality with the "£" string literal (or equivalently "\u00a3"), you'll get false in some target languages.

Below is a reproducible .ksy snippet that assumes a binary input c2 a3 (this is the pound sign encoded in UTF-8 using Python: "\u00a3".encode('utf-8').hex(' ') == 'c2 a3', "\u00a3" == '£'):

meta:
  id: str_literals_latin1
seq:
  - id: parsed
    size: 2
    type: str
    encoding: UTF-8
instances:
  parsed_eq_literal:
    value: parsed == "\u00a3"

According to my tests, parsed_eq_literal will be false in C++, Go, Lua, Nim, PHP and Ruby. This indicates that in these languages, the string literal "\u00a3" was translated incorrectly, as it apparently doesn't represent a UTF-8 string with the U+00A3 character (i.e. the pound sign):

$ grep -ri 'parsed_\?eq_\?literal.* =' -B1 | grep -F '\'
cpp_stl_11/str_literals_latin1.cpp:    m_parsed_eq_literal = parsed() == (std::string("\243"));
cpp_stl_98/str_literals_latin1.cpp:    m_parsed_eq_literal = parsed() == (std::string("\243"));
go/src/test_formats/str_literals_latin1.go:     this.parsedEqLiteral = bool(this.Parsed == "\243")
graphviz/str_literals_latin1.dot:                       <TR><TD>parsed_eq_literal</TD><TD>parsed == &quot;\243&quot;</TD></TR>
lua/str_literals_latin1.lua:  self._m_parsed_eq_literal = self.parsed == "\243"
nim/str_literals_latin1.nim:  let parsedEqLiteralInstExpr = bool(this.parsed == "\243")
php/StrLiteralsLatin1.php:            $this->_m_parsedEqLiteral = $this->parsed() == "\243";
ruby/str_literals_latin1.rb:    @parsed_eq_literal = parsed == "\243"

In contrast, in C#, Java, JavaScript, Perl, Python and Rust, the parsed_eq_literal instance evaluates to true, so we can say that "\u00a3" was translated correctly for these target languages:

construct/str_literals_latin1.py:       'parsed_eq_literal' / Computed(lambda this: this.parsed == u"\243"),
csharp/StrLiteralsLatin1.cs:                _parsedEqLiteral = (bool) (Parsed == "\u00a3");
java/src/io/kaitai/struct/testformats/StrLiteralsLatin1.java-        boolean _tmp = (boolean) (parsed().equals("\243"));
javascript/StrLiteralsLatin1.js:      this._m_parsedEqLiteral = this.parsed == "\xa3";
perl/StrLiteralsLatin1.pm:    $self->{parsed_eq_literal} = $self->parsed() eq "\243";
python/str_literals_latin1.py:        self._m_parsed_eq_literal = self.parsed == u"\243"
rust/str_literals_latin1.rs:        *self.parsed_eq_literal.borrow_mut() = (*self.parsed() == "\u{a3}".to_string()) as bool;
@generalmimon
Copy link
Member Author

To see which characters are affected, see the C1 Controls and Latin-1 Supplement table at https://en.wikibooks.org/wiki/Unicode/Character_reference/0000-0FFF

generalmimon added a commit to kaitai-io/kaitai_struct_tests that referenced this issue Sep 29, 2024
This reverts commit 1d8116f.

I'm reverting this for "compatibility reasons": it turned out that the
U+00A0 character was causing chaos in a number of target languages and
our CI infrastructure, which wasn't really the point. I've described the
cause in kaitai-io/kaitai_struct#1133. I will
probably cover this case in a separate test sometime in the future.
generalmimon added a commit to kaitai-io/kaitai_struct_tests that referenced this issue Sep 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant