Wrong charset when CodedCharacterSet=ESC - A
kenwa opened this issue · 3 comments
According to https://en.wikipedia.org/wiki/ISO/IEC_2022#cite_note-14.3.2-90 ISO-8859-1 should be used both when CodedCharacterSet is
- ESC % A
- ESC . A
- ESC - A
Currently, only the first two syntaxes are supported.
The fix seems to as simple as adding a new constant to Iso2022Converter
private static final byte MINUS_SIGN = 0x2D;
and add an extra if clause to com.drew.metadata.iptc.Iso2022Converter#convertISO2022CharsetToJavaCharset
if (bytes.length > 2 && bytes[0] == ESC && bytes[1] == MINUS_SIGN && bytes[2] == LATIN_CAPITAL_A) return ISO_8859_1;
The Iso2022ConverterTest.java should also be extended with
assertEquals("ISO-8859-1", Iso2022Converter.convertISO2022CharsetToJavaCharset(new byte[]{0x1B, (byte)0x2D, (byte)0x41}));
A pull request has been created #615
Thanks for the bug report and for the PR to fix it.
Are you able to share an image that reproduces this issue, so that we can add it to the public regression test data set?
Thanks very much! I ported your fix to the .NET implementation in drewnoakes/metadata-extractor-dotnet#335 and added your sample image in drewnoakes/metadata-extractor-images@31209ed.