espresso3389/pdfrx

[Web] Text selection on web always starts at start of line

Opened this issue · 5 comments

As mentioned in Issue #4 , text selection on Web has one remaining issue: it always selects complete lines.
This can be reproduced when trying to select a couple of words on the demo application https://espresso3389.github.io/pdfrx/

Some more observations (which might be obvious to you).

I noticed the following when opening the same (two-page) pdf

  • on Linux, PdfPageTextPdfium._loadText(...) created 581/292 fragments
  • on Web, PdfPageTextWeb._loadText(...)created only 72/43 fragments

When printing out the text of the resulting PdfPageTextFragment, I noticed that Pdfium seems to add fragments on word level while Web seems to add fragments per line.

This explains why a selection always starts at the beginning of the line I guess.
Not sure whether you can get also word-fragments on web somehow?

You're right. I don't know how to extract word level coodinates with pdf.js. pdf.js example viewer can handle word level coodinates but it uses something provided by HTML canvas or such. I need more research on that...

Any updates on the text selection feature for the web? It seems there is also an issue with consistency when selecting text. For example, sometimes it misses certain words or skips some parts

I've just googled the things and found the issue.

It explains the dedicated part to extract text positions is;

I'll read the codes to know how pdf.js handles text coordinates.

This is great news! Thanks for the heads up!