getDataTm() provides wrong coordinates for text blocks
Opened this issue · 1 comments
I found an issue with the getDataTm() method in version 2.11. In some cases, the result contains text from a neighboring block instead of the block specified by the coordinates. The reason is that the PDFObject::getTextArray() method returns some text from a "Do" command at the location of certain xobjects:
pdfparser/src/Smalot/PdfParser/PDFObject.php
Line 785 in ac8e667
Then, inside the getDataTm() method, strings from PDFObject::getTextArray() are matched with commands returned by the Page::getDataCommands() method:
pdfparser/src/Smalot/PdfParser/Page.php
Line 730 in ac8e667
pdfparser/src/Smalot/PdfParser/Page.php
Line 685 in ac8e667
However, the latter does not return the "Do" command, so there are more elements in PDFObject::getTextArray() than in Page::getDataCommands(), leading to a mismatch.
Unfortunately, I cannot provide a minimal PDF example. The files I have to parse are too large, and I don't know how they were generated. In my case, commenting out $text[] = $xobject->getText($page);
helped. Since I'm not sure what the original intent of handling "Do" was, I cannot suggest a pull request that would fix this issue.
I also had this problem, and made a workaround for myself in this if:
pdfparser/src/Smalot/PdfParser/PDFObject.php
Line 783 in ac8e667
I changed it from
if (\is_object($xobject) && $xobject instanceof self && !\in_array($xobject->getUniqueId(), self::$recursionStack, true)) {
// Not a circular reference.
$text[] = $xobject->getText($page);
}
to
if (\is_object($xobject) && $xobject instanceof self && !\in_array($xobject->getUniqueId(), self::$recursionStack, true)) {
// Not a circular reference.
//Only add to text if there was any Text to begin with, else the count of texts and TJ/Tj commands dont match and the last Texts will be ignored
$newText = $xobject->getText($page);
if($newText === ' ') {
break;
}
$text[] = $newText;
}
I didnt create a PR because i wasnt 100% sure if this is the correct fix, or just a dirty workaround. But maybe this can help someone with the problem.