smalot/pdfparser

getDataTm() provides wrong coordinates for text blocks

Opened this issue · 1 comments

I found an issue with the getDataTm() method in version 2.11. In some cases, the result contains text from a neighboring block instead of the block specified by the coordinates. The reason is that the PDFObject::getTextArray() method returns some text from a "Do" command at the location of certain xobjects:

$text[] = $xobject->getText($page);

Then, inside the getDataTm() method, strings from PDFObject::getTextArray() are matched with commands returned by the Page::getDataCommands() method:

$extractedTexts = $this->getTextArray();

$dataCommands = $this->getDataCommands();

However, the latter does not return the "Do" command, so there are more elements in PDFObject::getTextArray() than in Page::getDataCommands(), leading to a mismatch.

Unfortunately, I cannot provide a minimal PDF example. The files I have to parse are too large, and I don't know how they were generated. In my case, commenting out $text[] = $xobject->getText($page); helped. Since I'm not sure what the original intent of handling "Do" was, I cannot suggest a pull request that would fix this issue.

I also had this problem, and made a workaround for myself in this if:

if (\is_object($xobject) && $xobject instanceof self && !\in_array($xobject->getUniqueId(), self::$recursionStack, true)) {

I changed it from

if (\is_object($xobject) && $xobject instanceof self && !\in_array($xobject->getUniqueId(), self::$recursionStack, true)) {
    // Not a circular reference.
    $text[] = $xobject->getText($page);
}

to

if (\is_object($xobject) && $xobject instanceof self && !\in_array($xobject->getUniqueId(), self::$recursionStack, true)) {
    // Not a circular reference.

    //Only add to text if there was any Text to begin with, else the count of texts and TJ/Tj commands dont match and the last Texts will be ignored
    $newText = $xobject->getText($page);
    if($newText === ' ') {
        break;
    }
    $text[] = $newText;
}

I didnt create a PR because i wasnt 100% sure if this is the correct fix, or just a dirty workaround. But maybe this can help someone with the problem.