smalot/pdfparser

Weird UTF-8 Characters when parsing hex string (keywords)

code-mage-com opened this issue · 3 comments

  • PHP Version: 8.1.27
  • PDFParser Version: 2.8.0

Description:

When parsing attached PDF file, the keywords include weird "japanese" characters such as "挀爀椀猀琀椀愀渀愀ⰰ 最攀猀豈ⰰ 甀漀ݠؐȁـڐȁذڐ۰ذذ۰ۀؐ݀ؐˀȁذ۰niglietti"

After debugging, I found out that the reason is in ElementHexa::decode() where the value parameter is passed a hex string that is split in several lines of 80 chars per line, like so:

feff007000610073007100750061002c0020007000720069006d00610076006500720061002c0020
0072006500730075007200720065007a0069006f006e0065002c0020006600650073007400610020
0063007200690073007400690061006e0061002c002000670065007300f9002c00200075006f0076
0061002000640069002000630069006f00630063006f006c006100740061002c00200063006f006e
00690067006c00690065007400740069002c002000700075006c00630069006e0069002c00200070
00610073007100750061006c0065002c002000630061006d00700061006e0065002c002000640069
006e006100200072006500620075006300630069002c00200075006f007600610020006400690020
007000610073007100750061002c0020

I managed to get it working correctly by adding the preg_replace before initial length calculation here (to remove carriage returns and newlines, if any, before parsing):

    public static function decode(string $value): string
    {
        $text = '';
        $value = preg_replace('#[\r\n]+#', '', $value);
        $length = \strlen($value);

        if ('00' === substr($value, 0, 2)) {
            for ($i = 0; $i < $length; $i += 4) {
                $hex = substr($value, $i, 4);
                $text .= '&#'.str_pad(hexdec($hex), 4, '0', \STR_PAD_LEFT).';';
            }
        } else {
            for ($i = 0; $i < $length; $i += 2) {
                $hex = substr($value, $i, 2);
                $text .= \chr(hexdec($hex));
            }
        }
        $text = html_entity_decode($text, \ENT_NOQUOTES, 'UTF-8');

        return $text;
    }

PDF input

meta1.pdf

Expected output & actual output

Expected output:

pasqua, primavera, resurrezione, festa cristiana, gesù, uova di cioccolata, coniglietti, pulcini, pasquale, campane, dina rebucci, uova di pasqua, 

Actual output (without the preg_replace line):

pasqua, primavera, � 倇〇倇  倇ꀆ逆倂쀂�怆倇〇䀆ဂ�挀爀椀猀琀椀愀渀愀Ⰰ 最攀猀豈Ⰰ 甀漀瘀ؐȀـڐȀذڐ۰ذذ۰ۀؐ݀ؐˀȀذ۰؎iglietti, pulcini, p�ဇ〇ဇ倆ဆ쀆倂쀂�〆ဆ퀇�ဆ倂쀂�䀆ऀ渀愀 爀攀戀甀挀挀椀Ⰰ 甀漀瘀愀 搀椀 ܀ؐܰܐݐؐˀȀ

Code

Nice catch! :D

I think it would be better to strictly whitelist only hexadecimal digits rather than just excluding newlines and carriage returns:

$value = preg_replace('/[^0-9a-f]/i', '', $value);

But both definitely fix the issue with this file,

This should be a simple change, and if you can add your PDF to the test folder, you have a great file for a unit test. Do you want to create a PR for this fix?

@code-mage-com, may we add your test document meta1.pdf to the PdfParser repo so I can create a test case?