Weird UTF-8 Characters when parsing hex string (keywords)
code-mage-com opened this issue · 3 comments
- PHP Version: 8.1.27
- PDFParser Version: 2.8.0
Description:
When parsing attached PDF file, the keywords include weird "japanese" characters such as "挀爀椀猀琀椀愀渀愀ⰰ 最攀猀豈ⰰ 甀漀ݠؐȁـڐȁذڐ۰ذذ۰ۀؐ݀ؐˀȁذ۰niglietti"
After debugging, I found out that the reason is in ElementHexa::decode() where the value parameter is passed a hex string that is split in several lines of 80 chars per line, like so:
feff007000610073007100750061002c0020007000720069006d00610076006500720061002c0020
0072006500730075007200720065007a0069006f006e0065002c0020006600650073007400610020
0063007200690073007400690061006e0061002c002000670065007300f9002c00200075006f0076
0061002000640069002000630069006f00630063006f006c006100740061002c00200063006f006e
00690067006c00690065007400740069002c002000700075006c00630069006e0069002c00200070
00610073007100750061006c0065002c002000630061006d00700061006e0065002c002000640069
006e006100200072006500620075006300630069002c00200075006f007600610020006400690020
007000610073007100750061002c0020
I managed to get it working correctly by adding the preg_replace before initial length calculation here (to remove carriage returns and newlines, if any, before parsing):
public static function decode(string $value): string
{
$text = '';
$value = preg_replace('#[\r\n]+#', '', $value);
$length = \strlen($value);
if ('00' === substr($value, 0, 2)) {
for ($i = 0; $i < $length; $i += 4) {
$hex = substr($value, $i, 4);
$text .= '&#'.str_pad(hexdec($hex), 4, '0', \STR_PAD_LEFT).';';
}
} else {
for ($i = 0; $i < $length; $i += 2) {
$hex = substr($value, $i, 2);
$text .= \chr(hexdec($hex));
}
}
$text = html_entity_decode($text, \ENT_NOQUOTES, 'UTF-8');
return $text;
}
PDF input
Expected output & actual output
Expected output:
pasqua, primavera, resurrezione, festa cristiana, gesù, uova di cioccolata, coniglietti, pulcini, pasquale, campane, dina rebucci, uova di pasqua,
Actual output (without the preg_replace line):
pasqua, primavera, � 倇〇倇 倇ꀆ逆倂쀂�怆倇〇䀆ဂ�挀爀椀猀琀椀愀渀愀Ⰰ 最攀猀豈Ⰰ 甀漀瘀ؐȀـڐȀذڐ۰ذذ۰ۀؐ݀ؐˀȀذ۰؎iglietti, pulcini, p�ဇ〇ဇ倆ဆ쀆倂쀂�〆ဆ퀇�ဆ倂쀂�䀆ऀ渀愀 爀攀戀甀挀挀椀Ⰰ 甀漀瘀愀 搀椀 ܀ؐܰܐݐؐˀȀ
Code
Nice catch! :D
I think it would be better to strictly whitelist only hexadecimal digits rather than just excluding newlines and carriage returns:
$value = preg_replace('/[^0-9a-f]/i', '', $value);
But both definitely fix the issue with this file,
This should be a simple change, and if you can add your PDF to the test folder, you have a great file for a unit test. Do you want to create a PR for this fix?
@code-mage-com, may we add your test document meta1.pdf to the PdfParser repo so I can create a test case?