drewnoakes/metadata-extractor

Wrong charset when CodedCharacterSet=ESC - A

kenwa opened this issue · 3 comments

kenwa commented

According to https://en.wikipedia.org/wiki/ISO/IEC_2022#cite_note-14.3.2-90 ISO-8859-1 should be used both when CodedCharacterSet is

  • ESC % A
  • ESC . A
  • ESC - A

Currently, only the first two syntaxes are supported.

The fix seems to as simple as adding a new constant to Iso2022Converter

private static final byte MINUS_SIGN = 0x2D;

and add an extra if clause to com.drew.metadata.iptc.Iso2022Converter#convertISO2022CharsetToJavaCharset

if (bytes.length > 2 && bytes[0] == ESC && bytes[1] == MINUS_SIGN && bytes[2] == LATIN_CAPITAL_A) return ISO_8859_1;

The Iso2022ConverterTest.java should also be extended with

assertEquals("ISO-8859-1", Iso2022Converter.convertISO2022CharsetToJavaCharset(new byte[]{0x1B, (byte)0x2D, (byte)0x41}));
A pull request has been created #615

Thanks for the bug report and for the PR to fix it.

Are you able to share an image that reproduces this issue, so that we can add it to the public regression test data set?

kenwa commented

Sure! Due to access rights I cannot share the image where I found the issue, but created an image with a similar problem.

The image has CodedCharacterSet=ESC - A and a headline containing some french characters Headline=l'Affiche présentait étaient.

test

Thanks very much! I ported your fix to the .NET implementation in drewnoakes/metadata-extractor-dotnet#335 and added your sample image in drewnoakes/metadata-extractor-images@31209ed.