ajrcarey/pdfium-render

PdfPageTextObject.chars() returns wrong results for text objects with overlapping bounding boxes

cemerick opened this issue ยท 10 comments

I'd like to use pdfium-render to access all "primitive" elements (characters, paths, images) in the order that they are rendered, so that I can determine visibility for each such element (accounting for occlusion of primitives rendered earlier due to simple obstruction, clipping paths, etc).

I figured that I would be able to do this by iterating through PdfPage.objects(), and within that, iterating through each PdfPageTextObject.chars(). However, the latter doesn't retrieve individual chars specifically associated with a given text object; rather, it grounds out in a bounding-box search:

pub fn chars_for_object(
&self,
object: &PdfPageTextObject,
) -> Result<PdfPageTextChars, PdfiumError> {
self.chars_inside_rect(object.bounds()?)
.map_err(|_| PdfiumError::NoCharsInPageObject)
}

Of course, this doesn't reflect original rendering order at all, and ironically will result in the same character being visited multiple times, in the case of overlapping text objects.

Is there a way to access primitives, down to the character level, in rendered order (or with a render-order property if direct iteration isn't possible)?

(Thanks so much for this library, the work is greatly appreciated. ๐Ÿ™‡)

Hi @cemerick , thanks for this interesting question.

Pdfium doesn't expose any information about the order in which it renders page objects. For the purposes of experimentation, I guess we can start with the assumption that the rendering order of page objects is the same as the iteration order of those page objects. That is just an assumption, however (although I will check to see if there's anything in the PDF standard about this).

So assuming that assumption is correct, then yes, you should be able to iterate through all page objects in order and assume that later page objects are rendered "on top" of earlier page objects.

I'm not understanding why you are interested in each individual character inside a text object. I would have thought text object itself is the rendering primitive, not the characters within it - that is, there is no way a (e.g.) path object could occlude some characters in a text object, but itself be occluded by other characters in the same text object. Either all the characters in the text object would be "behind" the path, or they'd all be "in front of" the path.

Can you explain a little more why you're interested in each individual character? Do you have some test documents you can share that have interesting examples of occlusion that you're trying to detect?

Pdfium doesn't expose any information about the order in which it renders page objects. For the purposes of experimentation, I guess we can start with the assumption that the rendering order of page objects is the same as the iteration order of those page objects. That is just an assumption, however (although I will check to see if there's anything in the PDF standard about this).

I'm still murky on the abstractions that pdfium provides ("page objects" in the PDF spec are specifically related to the page tree, not individuated rendered elements), but insofar as e.g. PdfPageTextObjects are style-coherent runs of characters (probably corresponding to individual e.g. TJ or Tj PDF operations), then the render ordering of those runs of text are definitely in the order as they are encoded in the source document. This seems to be basically confirmed @ https://groups.google.com/g/pdfium/c/Y5TBNRriJHk/m/UA9fZXszBQAJ (which further calls out appearance streams as not conforming to this intuition, but I luckily don't care much about them, at least for now).

I'm not understanding why you are interested in each individual character inside a text object. I would have thought text object itself is the rendering primitive, not the characters within it - that is, there is no way a (e.g.) path object could occlude some characters in a text object, but itself be occluded by other characters in the same text object. Either all the characters in the text object would be "behind" the path, or they'd all be "in front of" the path.

For example, say you have a text object consisting of the characters "0123456789", and a filled rect is positioned to overlap chars 5-9, how are you going to determine that programmatically? Character-level bounding boxes are necessary, but as I said, PdfPageTextObject.chars() doesn't reliably yield the actual characters that constitute a text object, due to it doing a separate window query (which will scoop up characters that are part of other text objects, if their respective bounding boxes happen to be within that of the text object).

I hope this is clarifying! Although, given what I've read since my first message in the pdfium sources and google group (e.g. https://groups.google.com/g/pdfium/c/qivGc4X2r2E/m/1TWKF1tJBgAJ), I'm not optimistic that what I'm after is a reasonable objective with pdfium, at least not without some enhancements to it. Of course, if you find that I'm being too pessimistic, I'll be all ears. ๐Ÿ˜ƒ

Ok, I see what you're getting at. Am I right in thinking that the fundamental problem here is that PdfPageObjectText::chars() (which uses PdfPageText::chars_for_object() under the hood) doesn't guarantee that it only returns the characters in the given text object, but (potentially) any characters in the bounding box area of the given text object? (As you say, there may be overlapping text boxes.)

If I'm right in thinking that's the fundamental problem, then let's pretend for a moment that that problem could be solved, such that PdfPageObjectText::chars() did return only the characters that were actually in the given text object. In that case, would I be right in thinking that you could use an approach along the lines of https://github.com/ajrcarey/pdfium-render/blob/master/examples/chars.rs to get the actual bounding boxes of the individual characters and, from there, perform your occlusion detection?

The reason I'm asking this is because I think there probably is a way to work around the limitations of PdfPageObjectText::chars(), but I'll warn you now it's ... convoluted.

(You're right in thinking that Pdfium itself does not expose the exact functionality you ideally need here - just the ability to return all characters within a given bounding box, whether they overlap or not.)

In that case, would I be right in thinking that you could use an approach along the lines of https://github.com/ajrcarey/pdfium-render/blob/master/examples/chars.rs to get the actual bounding boxes of the individual characters and, from there, perform your occlusion detection?

I mean, PdfPageTextChar offers bounding box accessors, so yes, once I have a handle on the actual char structs that comprise a text object, then handling occlusion or not is straightforward (modulo complex clipping paths, but that's my problem, etc ๐Ÿ˜„).

I'll warn you now it's ... convoluted

That's a great motto for working with PDFs in general! ๐Ÿคท ๐Ÿ˜†

That's a great motto for working with PDFs in general!

Yeah, you're not wrong :)

Given that text overlap is the primary problem, we need to remove the possibility of overlap. There are two options that come to mind:

  1. As you're iterating over page objects, each time you come to a text object, move it to some off-page position where it won't be overlapped/overlapping anything. Then .chars_for_object() will return results just for that page object. You'll need to un-translate each character bounding box by the inverse of whatever you translated the page object by to move it into a non-overlapped position, but it should work.
  2. If you want to avoid moving objects around on the page, then as you're iterating over page objects, each tiem you come to a text object, create a new PdfPageGroupObject containing just that text object, copy the group onto a new page (so that the target page object - or, rather, a copy of it - will be the only object on the new page, and therefore guaranteed not to be overlapped/overlapping anything), use .chars_for_objects() on the copy to get bounding boxes for each character, then delete the newly created page once you're done.

Both have pros and cons. With option 1, you need to do some manual translation and un-translation, which is a bit cumbersome. Option 2 avoids this (the copied object will be at the same position on the new page as it was on the original page), but there are some limitations when copying objects in Pdfium (as detailed in #60).

Option 1 is probably also likely a bit more efficient performance-wise if you're processing thousands or millions of objects. (EDIT: actually, on reflection I'm not sure about this: the FPDFPageText needs to be regenerated each time an object moves in order for .chars_for_object() to work, and this is likely much faster on a newly created page containing just a single object than on an existing page containing lots of objects.)

I do consider this to be a bug in pdfium-render - the PdfPageTextObject::chars() function is advertising functionality that, in the case of overlapping text objects, isn't correct - so I am happy to work on this and I would probably look to follow option 1 to start with. However, I am away for much of August so won't make much progress before the end of the month. If you wanted to play around with either of these two options in the meantime - or come up with some other crazy scheme to get the right result - go for it!

Option 1 there does work, with some caveats:

  1. chars_for_object is really flawed, to the point of reliably producing duplicate char entries when rotated characters are involved. The workaround I have at the moment is to always translate text objects well off of the page's bounds, and then doing a brute-force filter of all characters returned by PdfPageText that are past the off-page threshold value.
  2. That workaround, plus obtaining a new "regenerated" PdfPageText after translating each text object (as you pointed out in your EDIT parenthetical), yields some pretty poor performance.

Thank you very much for the creative pointer re: the translation trick. (I rarely think of using a mutable document model, so I'm slightly embarrassed that I didn't think of it!) I'll continue to tinker with other "creative" options to avoid the performance problems, perhaps translating every text object into deterministic off-page space. Seeing the translation trick basically working, I'm left really confused as to how the link between text objects and PdfPageTextChars works; obviously a translation applied to the former does filter down to the latter, somehow. (Not actually looking/expecting for an answer from you, just thinking out loud.) I wonder if some spelunking in the FPDF* APIs might be worthwhile. ๐Ÿ˜ฌ

(EDIT: I see now that the text objects are the basal representation in pdfium, and that char-level data is a second-order artifact via FPDF_TEXTPAGE, etc.)

A final (I think?) update from me:

Performance has now exceeded my expectations, given:

  1. I now translate all text objects on each page before obtaining a PdfPageText, once. Some bookkeeping is required to ensure that the text objects are shifted deterministically so each of them lands in their own "column" of horizontal space, but it works.
  2. Rather than using chars_for_object(), or any kind of brute force search over all characters for each text object, I index all character structs with an interval tree (using the chars' x coordinates). Now obtaining chars for all text objects is probably something like O(n log(n)) in aggregate, where it was probably something like O(n^3) before.

I can't imagine that this kind of implementation would be a good addition to the library, or I'd suggest a PR; it works for my purposes, but I wouldn't think it a reasonable approach in general.

It might make for an interesting example, if you felt like sharing... up to you.

I will take a more general approach in PdfPageTextObject.

Initially I thought a simple check comparing the length of the text returned for the text object's bounding box against the text returned by calling FPDFTextObj_GetText() would be sufficient to determine the (hopefully rare) situation where the text object is overlapping another, but it's not that simple, as the following sample demonstrates:

use pdfium_render::prelude::*;

fn main() -> Result<(), PdfiumError> {
    let pdfium = Pdfium::new(Pdfium::bind_to_library(
        Pdfium::pdfium_platform_library_name_at_path("../pdfium/"),
    )?);

    // Create a new document with two overlapping text objects.

    let mut document = pdfium.create_new_pdf()?;

    let mut page = document
        .pages_mut()
        .create_page_at_start(PdfPagePaperSize::a4())?;

    let font = document.fonts_mut().times_roman();

    let txt1 = page.objects_mut().create_text_object(
        PdfPoints::ZERO,
        PdfPoints::ZERO,
        "AAAAAA",
        font,
        PdfPoints::new(10.0),
    )?;

    let txt2 = page.objects_mut().create_text_object(
        PdfPoints::ZERO,
        PdfPoints::ZERO,
        "BBBBBB",
        font,
        PdfPoints::new(10.0),
    )?;

    let page_text = page.text()?;

    println!("{}", page_text.all());

    if let Some(txt1) = txt1.as_text_object() {
        println!("{}", txt1.text());
        println!("{}", page_text.for_object(txt1));
        for (index, char) in txt1.chars(&page_text)?.iter().enumerate() {
            println!(
                "{}: {:?} ==? {:?}",
                index,
                txt1.text().chars().nth(index),
                char.unicode_string()
            );
        }
    }

    if let Some(txt2) = txt2.as_text_object() {
        println!("{}", txt2.text());
        println!("{}", page_text.for_object(txt2));
        for (index, char) in txt2.chars(&page_text)?.iter().enumerate() {
            println!(
                "{}: {:?} ==? {:?}",
                index,
                txt2.text().chars().nth(index),
                char.unicode_string()
            );
        }
    }

    Ok(())
}

A general solution to this probably requires always creating a temporary page containing nothing but the text object for which characters are being retrieved. Terrible for performance, obviously.

Adjusted PdfPageTextChars so it can take ownership over a temporary page used by a cloned object, if necessary. Confirmed test results now pass correctly for overlapping objects. Reworked test code above as unit tests. Ready to release as part of crate version 0.8.13.