Wrong charset when CodedCharacterSet=ESC - A

Question

Wrong charset when CodedCharacterSet=ESC - A

kenwa opened this issue 2 years ago · 3 comments

According to https://en.wikipedia.org/wiki/ISO/IEC_2022#cite_note-14.3.2-90 ISO-8859-1 should be used both when CodedCharacterSet is

ESC % A
ESC . A
ESC - A

Currently, only the first two syntaxes are supported.

The fix seems to as simple as adding a new constant to Iso2022Converter

private static final byte MINUS_SIGN = 0x2D;

and add an extra if clause to com.drew.metadata.iptc.Iso2022Converter#convertISO2022CharsetToJavaCharset

if (bytes.length > 2 && bytes[0] == ESC && bytes[1] == MINUS_SIGN && bytes[2] == LATIN_CAPITAL_A) return ISO_8859_1;

The Iso2022ConverterTest.java should also be extended with

assertEquals("ISO-8859-1", Iso2022Converter.convertISO2022CharsetToJavaCharset(new byte[]{0x1B, (byte)0x2D, (byte)0x41}));
A pull request has been created #615

Answer 1 · 2023-05-12T03:05:05.000Z

Thanks for the bug report and for the PR to fix it.

Are you able to share an image that reproduces this issue, so that we can add it to the public regression test data set?

Answer 2 · 2023-05-12T12:28:49.000Z

Sure! Due to access rights I cannot share the image where I found the issue, but created an image with a similar problem.

The image has CodedCharacterSet=ESC - A and a headline containing some french characters Headline=l'Affiche présentait étaient.

Answer 3 · 2023-05-22T05:30:25.000Z

Thanks very much! I ported your fix to the .NET implementation in drewnoakes/metadata-extractor-dotnet#335 and added your sample image in drewnoakes/metadata-extractor-images@31209ed.