Hard coded endianess in TagDescriptor.GetEncodedTextDescription
reinfallt opened this issue · 8 comments
GetEncodedTextDescription uses hard coded big endian for unicode and little endian for utf32.
As far as I can tell, the endianess of tags like the UserComment tag seems to depend on the endianess of the file as a whole. I added a UserComment tag using Digikam in a little endian jpeg file and the tag was encoded as little endian unicode.
The raw data of the tag:
55 4E 49 43 4F 44 45 00 48 00 E5 00 72 00 69 00 67 00 20 00 73 00 61 00 6B 00
It makes sense since the spec doesn't mention any endianess specifically for this tag. I don't know about utf32 since it is not officially in the spec.
@reinfallt can you share an image that fails to decode due to this?
In cases where the spec is under-specified (or when common deviations from the spec occur in practice), we have two options:
- Do what appears to work in the majority of cases
- Use a heuristic to detect which case is in use
However because this is a general issue, we have the StringValue
type which, internally, retains the bytes that back the string. This means that the consumer needn't use the encoding the library chooses, and may perform an alternative decoding in their own code.
If you find a reasonable heuristic for such cases, it would be great if you could share it.
Here is a picture with little endian utf16 user comment tag:
https://www.dropbox.com/s/6oupzwyyfab0tn3/20190511_083402.jpg?dl=0
Exif SubIFD - User Comment = ??????
Should be:
Exif SubIFD - User Comment = Rådjur
Thanks for the sample image.
The Java library decodes this correctly by default, which the .NET one does not. We should find the implementation difference between the two.
Currently Java interprets UNICODE
as little endian, and .NET interprets UNICODE
as big endian.
Changing .NET to match Java:
- fixes decoding of
Issue 313 (dotnet).jpg
- regress decoding of
Issue 173 (dotnet).jpg
Java's text decoding appears to gracefully handle inverted Unicode, as Java handles both of these files in the same way. We may be able to replicate that here.
Those files specify II
and MM
encoding, respectively. As you suggest, we could take that into account, though experience shows that even that would fail sometimes. If a reliable heuristic exists, it may be a more robust solution.
A product I am using hits this same bug as well, so I want to resurface it.
It seems to me, there may not be any ambiguity in the specification. I am not a Java developer (long-time .NET and c++ before that), but in my experience, UNICODE
in a context like this tends to just mean UTF-16, completely without regard for endianness. In my experience, that's a signal for "the endianness is specified somewhere else". So again, for me, the correct solution seems like the one suggested before -- the endianness declared at the higher level of the file, should apply to this directory as well.
Here are a couple test images I've been working with. You'll notice, both have identical metadata, one with big-endian and one with little-endian. (The metadata for both was embedded using exiftool
.) They have Unicode text in both the Copyright text (which works regardless of Endianness) and the UserComment string (which works for big endian but breaks for little endian).
Here is a unit test I wrote that tests both images, works for big-endian, and fails for little-endian:
[Theory]
[InlineData("Data/unicode_be.jpg")]
[InlineData("Data/unicode_le.jpg")]
public void TestUnicode(string filename)
{
var metadata = JpegMetadataReader.ReadMetadata(filename);
var directory1 = metadata.OfType<ExifIfd0Directory>().FirstOrDefault();
Assert.NotNull(directory1);
var directory2 = metadata.OfType<ExifSubIfdDirectory>().FirstOrDefault();
Assert.NotNull(directory2);
Assert.Equal("This is a test, 学生", directory1.GetString(ExifDirectoryBase.TagCopyright));
Assert.Equal("Diffusers ☛ YamerMIX NihilMania", directory2.GetDescription(ExifDirectoryBase.TagUserComment));
}
P.S. Also, I have never heard of a "reliable heuristic" for determining the "best fit" for endianness. Indeed, in many cases the same string of bytes can be interpreted as either UTF-16BE or UTF-16LE. I suppose if you see a long stretch with alternating zero and non-zero bytes, one could hazard a guess, but still that would feel like an unsafe idea to me. There have been so many bugs over the years, in all layers of OS and software, that were caused by bad assumptions when it comes to Unicode encoding.
P.P.S. In the meantime I guess I'll look at seeing if I can patch the other product I mentioned, to try to use the raw StringValue instead and do the decoding itself, as you suggested above. (But that seems like a hack that doesn't benefit any of your other users.) Or redo my entire workflow to try to force everything to be tagged big-endian before the files ever reach this point... but that's the worst of all solutions. I am definitely curious about the situations where "experience shows" that using the endianness declared by II or MM would fail.
I guess my problem is, once the metadata has been read, there seems to be no way for the software (which is using metadata extractor) to detect what has happened. For example, I use metadata extractor to read the metadata for a file, and it simply tells me the UserComment is "䐀椀昀昀甀猀攀爀猀 ᬦ 夀愀洀攀爀䴀䤀堀 一椀栀椀氀䴀愀渀椀愀"
. Okay, great, so the user comment is a bunch of gibberish that I assume to be Chinese or Japanese. (I suppose sharp-eyed viewers may note that there is a single NON-CJK character in there, and maybe that could be a hint that something MIGHT be wrong??) But at this point, I am querying the metadata, and no error or bug is detected, so I have no choice but to assume metadata extractor was telling me the truth, and this is actually the UserComment. I cannot tell, at this point, that the UserComment is actually "Diffusers ☛ YamerMIX NihilMania"
. In fact, from what I can tell, once the metadata has been read out of the file, the byte order is not remembered at all -- it is used to read the directories (though not as thoroughly as one might hope, hence this bug!!) but then it is discarded.
It would be nice if the directory
at least exposed (publicly) the byte order that was used when it was being read. Because then, at least I have something I consider a red flag to look for: if I'm getting a Unicode string back from GetDescription, but the directory's byte order is little endian, then I could manually decode it again. (i.e. work around the bug.) But as things sit right now, I can't even do that... I either have to manually re-read the file to discern the byte order, or I have to choose for my application to support ONLY big-endian or little-endian unicode text encoding. :-( :-(
Apologies if I'm missing something and there is a workaround I am not seeing!
After studying the code some more, I struck out on the fixes and workarounds I had in mind... that code for reading & initializing directory structures looks pretty particular, and I did not relish the thought of changing anything in order to get access to the byte order information in more places (although in my opinion, that would still be the best solution).
Obviously, I was initially dismissive about the idea of a heuristic to detect endianness. However, having read a little bit, I realized I was conflating the problem of a general-purpose heuristic for detecting encoding, and a specific heuristic to just tell the difference between UTF-16BE and UTF-16LE. There are several things that can work fairly reliably there (though there are probably still use cases that are not perfect).
I searched for a few minutes and didn't see an obvious one online, so i ended up implementing one myself. It's a little brute-force, but it works pretty well in my testing... both for the unit test I submitted above, and in the 3rd party software through which I became acquainted with metadata-extractor. I just added a couple of private static methods on class TagDescriptor that implement detection of little endian text, and changed a couple lines to invoke that code when necessary. I would be happy to submit a pull request if you're interested.
at TagDescriptor.cs line 363:
var idCode = Encoding.UTF8.GetString(commentBytes, 0, 8).TrimEnd('\0', ' ');
if (idCode == "UNICODE" && DetectLittleEndian(commentBytes, 8))
{
encodingMap["UNICODE"] = Encoding.Unicode; // swap for little endian insead
}
if (encodingMap.TryGetValue(idCode, out var encoding))
{
It has been awhile but I remember looking at moving the decoding of the tags that use the GetEncodedTextDescription method from the Descriptors to the CustomProcessTag method in the ExifTiffHandler:
It would be a fairly simple fix.
These tags currently use the GetEncodedTextDescription method:
ExifDirectoryBase.TagUserComment,
GpsDirectory.TagProcessingMethod,
GpsDirectory.TagAreaInformation