xoreos/xoreos-tools

NWN2's TLK mixes encodings, breaks tlk2xml

DrMcCoy opened this issue · 16 comments

The TLK file in Neverwinter Nights 2 mixes encodings. This breaks our tlk2xml utility.
Spawned off of this thread on the Neverwinter Vault: https://neverwintervault.org/comment/36764

When interpreting the TLK as CP-1252 (either using the command line switch --cp1252 or --nwn2), iconv will complain about some of these (and output "[!?!]") and produce garbage for others. When interpreting the TLK as UTF-8 (using the command line switch --utf8), our UTF-8 string class will throw for some of them.

What follows is a rundown of all problematic strings (with IDs and the raw data of the offending characters in hex notation) found in the unmodified English dialog.tlk from the GOG version of Neverwinter Nights 2. The natural encoding should be Windows Codepage 1252, but some strings are UTF-8 instead, a few Polish strings are probably CP-1250 instead, and some very broken strings are even double-UTF-8.

  • û in Faerûn, Faerûnian, Selûne, Selûnian is encoded as UTF-8 [C3 BB]:

    • 252
    • 12891
    • 13008
    • 13010
    • 13022
    • 13034
    • 13038
    • 13040
    • 13050
    • 13056
    • 13410
    • 13424
    • 13446
    • 13906
    • 13921
    • 13961
    • 13978
    • 13993
    • 14017
    • 14030
    • 14076
    • 14088
    • 40083
    • 41024
    • 41128
    • 47128
    • 47173
    • 48907
    • 48937
    • 60920
    • 60963
    • 68839
    • 68850
    • 75624
    • 75631
    • 79701
    • 79747
    • 83492
    • 86778
    • 91185
    • 91248
    • 91258
    • 94988
    • 95023
    • 95081
    • 107813
    • 112085
    • 127464
    • 128485
    • 129198
    • 131466
    • 131937
    • 131938
    • 131939
    • 132428
    • 132681
    • 132683
    • 132686
    • 132827
    • 132830
    • 133552
    • 133553
    • 136564
    • 138234
    • 138266
    • 138941
    • 139484
    • 142353
    • 142830
    • 142831
    • 142847
    • 142848
    • 142855
    • 142876
    • 143298
    • 145320
    • 146031
    • 146607
    • 150041
    • 151576
    • 151969
    • 151970
    • 151986
    • 151987
    • 151994
    • 152163
    • 152184
    • 152185
    • 152193
    • 152194
    • 158868
    • 158913
    • 159535
    • 159541
    • 159549
    • 159575
    • 159668
    • 160377
    • 161440
    • 161613
    • 161913
    • 162188
    • 162225
    • 162394
    • 164349
    • 165035
    • 168344
    • 173752
    • 175833
    • 176226
    • 176228
    • 176452
    • 176463
    • 176797
    • 178078
    • 178232
    • 179492
    • 179646
    • 182325
    • 183608
    • 183610
    • 183613
    • 183616
    • 183619
    • 183621
    • 184288
    • 185484
    • 190706
    • 192763
    • 194275
    • 206565
    • 228547
    • 230125
    • 231209
    • 234736
    • 234737
    • 234746
  • é in fiancé, décor and protégé is encoded as UTF-8 [C3 A9]:

    • 14074
    • 14101
    • 206490
  • ï in naïve is encoded as UTF-8 [C3 AF]:

    • 144840
  • Double-UTF-8 (UTF-8 data interpreted as Windows CP-1252 and then encoded as UTF-8 again):

    • 155891
    • 159776
    • 159836
    • 174702
    • 176247
    • 176670
    • 176671
    • 177416
    • 181507
    • 176832 ([C3 A2 E2 82 AC E2 80 9D], that's em-dash [0xE2 0x80 0x94] in double-UTF-8)
  • ½ is encoded as CP-1252 [BD]:

    • 241
    • 6075
    • 185809
    • 185819
  • © is encoded as CP-1252 [A9]:

    • 3038
  • … (ellipsis) as CP-1252 [85]:

    • 75941
    • 75947
    • 75949
    • 75950
    • 76064
    • 76066
    • 76068
    • 76070
    • 76071
    • 76072
    • 76080
    • 76082
    • 76114
    • 76124
    • 76135
    • 76136
    • 76144
    • 76153
    • 76182
    • 76356
    • 76367
  • CP-1252 smart single quotes ‘ [91] and ’ [92]:

    • 53031
  • CP-1252 smart apostrophe ’ [92]:

    • 13770
    • 25203
    • 90857
    • 90861
    • 91117
    • 91125
    • 91129
    • 91145
    • 91179
    • 91262
  • UTF-8 smart double quotes “ [E2 80 9C] and ” [E2 80 9D]

    • 137098
    • 155799
    • 161607
    • 162084 (also with UTF-8 Faerûn)
    • 177903
    • 178401
    • 180942 (also with UTF-8 … (ellipsis) [E2 80 A6])
    • 205916
    • 218438 (also with UTF-8 ’ (apostrophe) [E2 80 99])
    • 232910
    • 232945
    • 232952
    • 232958
    • 232962
  • Polish strings with the letter ł:

    • 3114
    • 3116
  • Polish strings with unknown letter (ż?):

    • 3143
    • 3145
  • Unknown strings with unknown encoding, two unknown letters (Polish? One of the letters might be ą?):

    • 3127
    • 3128
    • 3129
    • 3134
    • 3137
    • 3138
  • Unknown strings with unknown encoding, single letters (might be CP-1252?):

    • 3154
    • 3155
    • 3157

Would iconv be useful for this? At least for building a conversion dictionary.

We're already using iconv in https://github.com/xoreos/xoreos/blob/master/src/common/encoding.h / https://github.com/xoreos/xoreos/blob/master/src/common/encoding.cpp .

Converting isn't the problem, the issue is identification, which is not really 100% possible. Then there's the double-UTF-8. And if there's strings with multiple encodings, that's even more trouble.

Also, how should we handle it in the xoreos-tools? Silently convert everything to UTF-8? Will that work for the original game?

Yes, it doesn't sound like a completely automatable solution is available. iconv will at least output a 0xFF character if it doesn't match the encoding set, so that will catch some. Maybe the rest can be eye-balled and put in a conversion array/file, one per language, at least until a better solution is found? shrug

There's a perl module that guesses at the encoding of a text string:
https://metacpan.org/source/DANKOGAI/Encode-2.98/lib/Encode/Guess.pm
https://perldoc.perl.org/Encode/Guess.html
Perhaps that could be useful for deriving some coding logic?

This looks interesting:
https://github.com/neitanod/forceutf8/blob/master/src/ForceUTF8/Encoding.php

Perhaps it can be used as an identification tool?

A thought occurred: it may be possible that neither NWN nor NWN2 is using those 3K fields. I searched the 2da files in NWN2 for a sample but found no matches. Ideally one could extract all the StrRef entries in both games and do a compare.

Yes, I'm pretty sure the game is not using some of the really broken ones.

This does help us in xoreos proper, but how are we going to handle it in our tlk2xml tool?

"Converting isn't the problem, the issue is identification, which is not really 100% possible. Then there's the double-UTF-8. And if there's strings with multiple encodings, that's even more trouble."

If you believe that there are other instances of encoding issues, then perhaps a probability-based approach will work? Write a utility that can build a frequency count table of byte pairs. Encoding issues will presumably be outliers, so pass an argument specifying a cut-off count, with table entries at or below this argument being output as a conversion file. There's a lot of sample data to work with, so most of the remaining bad encoding patterns that haven't already been caught should be the (relatively) rare exceptions.

This tool makes the first pass through through the file building a count array of all double-byte patterns, ignoring white spaces and punctuation. The second pass through can then build a draft conversion file, listing a hex data array of the 2+ byte combinations followed by the containing word as a comment. Thus:

0xC3, 0xA9, // fiancé (1 instance)

If the file isn't too noisy with false negatives, we can manually peruse the exception list and throw out the ones that look okay. Hopefully the hand-massaged data file can now be used as the front end of a conversion table.

So... is this something like you have in mind? Or am I just plain misunderstanding?

"Also, how should we handle it in the xoreos-tools? Silently convert everything to UTF-8? Will that work for the original game?"

Are you contemplating building a modified TLK of all UTF-8 characters that can be used in the original game? We could try the output in the original game and see if the non-verbal text is intelligible. It should show up in the character build stage, such as when you read the class descriptions.

Are you contemplating building a modified TLK of all UTF-8 characters that can be used in the original game?

No.

What I'm saying is this: right now, we have two tools: tlk2xml, which converts a whole TLK into a user-editable XML, and xml2tlk, which takes such a XML and converts it back into a TLK. Only that the first tool has, of course, no knowledge which strings are used and aren't used by the game. It simply converts all of the strings. And it breaks for these broken strings.

How are we going to handle the NWN2 TLK, with those broken strings, in these tools, so that a modder can use them to take the NWN2 TLK, modify some strings, and recreate a working TLK out of it again.

Because right now, that use-case is broken. tlk2xml will take the NWN2 TLK, and produce a XML containing garbage in some strings. After going through xml2tlk, you'll have a TLK file containing broken strings, and that's bad and dangerous.

Okay. Well I find it really cool all the file conversion and manipulation content you've developed here, but unless some character combinations are going to break the engine I'm not sure it's worth breaking a sweat over this little detail. All the gamer is going to (potentially) see are some bad strings in the game. If the mod builder cares, they'll edit their TLK. Otherwise the gamer will just keep on playing. :-)

You've listed a finite set of character issues to address. Why not just deal with those, both for the conversion and the reverse, then worry about future exceptions down the road? That'll limit the scope and make it do-able in the near term.

If you are worried about game crashes from text string combinations, then it seems like some testing is needed. shrug

During testing of a new Journal class, I ran into what appears to be an error with strref #180942. This value is retrieved for the "construct" quest in the OC module.JRL file. The Journal routine used the getString call from the GFF3Struct class to get the "Text" field for the first entry for the quest, which is the above strref.

On running the game it generates the error: "WARNING: iconv() failed: Illegal byte sequence!" and returns the string "[!?!]". I checked the dialog.TLK row and it looked like ordinary text, at least in TLK EDIT.

Not much I can do about it at the moment.

Ed.: Now I see you have it listed above.

How are we going to handle the NWN2 TLK, with those broken strings, in these tools, so that a modder can use them to take the NWN2 TLK, modify some strings, and recreate a working TLK out of it again.

tlk2xml can output invalid strings to XML in hex/base64 and mark it with some attribute, for example broken="true". xml2tlk must just interpret such strings as raw byte arrays (in hex/base64 due to XML limitations) and write it as is.

tlk2xml can output invalid strings to XML in hex/base64 and mark it with some attribute, for example broken="true".

Yeah, that seems to be solution that least breaks things.

For Phaethon, if that ever gets a TLK editor, we can probably add a drop-down box and let the user override a misidentification.

During testing of a new Journal class, I ran into what appears to be an error with strref #180942. This value is retrieved for the "construct" quest in the OC module.JRL file. The Journal routine used the getString call from the GFF3Struct class to get the "Text" field for the first entry for the quest, which is the above strref.

Hmm, so those strings are actually used in the game. I had hoped they weren't. :P

How are we going to handle that in xoreos, then, though?

How does the original game handle that in the first place? The string itself is probably just read and treated a a raw byte array. And only when the text is displayed, it selects the correct character. But how does it know that the 0xE2 here is the start of an UTF-8 ellipsis and not a â?

Can we catch such iconv errors in the GFF3Struct::getString call so we can at least get a report on the ResRef value where it failed?

...There was a reason I made a failed encoding conversion not throw...but I can't remember anymore what that reason was :/

A kludgy work-around then is to check the converted string within the getString call and print an informative warning message if it matches the error pattern?