lebebr01/pdfsearch

Page numbers

Closed this issue · 3 comments

Hey Brandon, thanks for making this package, I've found it incredibly useful.

I'm not sure this is an issue, maybe more of a question. I was wondering how the page numbers are defined in found text. For example (using the v0.3.0):

library(pdfsearch)
download.file('https://www.tescoplc.com/media/757589/tesco_annual_report_2021.pdf', destfile = 'tesco_2021.pdf')

keyword_search(
  'tesco_2021.pdf', 
  keyword = 'Tesco was built to be a champion for customers', 
  path = TRUE, 
  ignore_case = TRUE
)

# A tibble: 1 × 5
  keyword                                        page_num line_num line_text token_text
  <chr>                                             <int>    <int> <list>    <list>    
1 Tesco was built to be a champion for customers        2        4 <chr [1]> <list [1]>

The result returned here is page 2 - that's sort of correct, as the text is first found on the 2nd numbered page. But there are 2 pages before the numbering begins which I'd prefer were included in the page count (in my case because I want to point something like tabulizer at a specific location in the doc).

Just wondered if there was any way to change how page numbers are defined. Maybe this is already possible?

Anyways, thanks again for all your work with this. It's a terrific package that works a treat!

Alastair

I for one do think this is a genuine issue that should be fixed.

First, pdf files have a ‘physical’ page numbering, independent of whatever might or might not be displayed on any of the pages, and I tend to see this as the preferred number pdfsearch should use.

Second, pdf files also may (but not all do) contain ‘logical’ page numbering distinct from physical page numbering, e.g., front matter with pages numbered in lowercase roman followed by main content numbered in Arabic numerals, plus sometimes appendices with page numbers such as “A-1’. For correct citations one would usually need to give the ‘logical’ page numbering. pdfsearch could possibly provide an option to report these ‘logical’ page numbers rather than the ‘physical’ ones, but that’s not super high on my wish list, as ‘logical’ page numbers can be reliably inferred from physical ones if the numbering schema for a given document is known.

Finallly, of course page numbers are often displayed on the pages themselves, and while it might be possible to analyse page content (headers, footers, …) to identify these (an option the OP might have been alluding to), and thereby provide logical page numbers if the pdf does not contain any, or to check and possibly override them if the pdf does, I currently don’t see a pressing need for pdfsearch to try and implement this option unless it actually is already, in which case it should be fixed.

pdfsearch, however, reports page numbers that seem to have nothing to do with either of these three options.

In my opinion, ‘physical’ pdf page numbers are what pdfsearch should report as a default, and it’s be great if pdfsearch could be fixed accordingly.

To illustrate this further, I tested with the tmap manual:

library(pdfsearch)
download.file('https://cran.r-project.org/web/packages/tmap/tmap.pdf', destfile = 'tmap.pdf')
keyword_search(
  'tmap.pdf', 
  keyword = 'annealing', 
  path = TRUE, 
  ignore_case = TRUE
)

Result:

# A tibble: 1 × 5
  keyword   page_num line_num line_text token_text
  <chr>        <int>    <int> <list>    <list>    
1 annealing       35     1229 <chr [1]> <list [1]>

The actual physical page number the string ‘annealing’ is found on, however, is not on page ‘35’, but ‘108’, and all pages except the first display page numbers in their headers. I am not certain what the ‘35’ is supposed to refer to, but it isn’t anything I can use, e.g., for citation purposes.

Thanks for submitting this and I agree, it does need some work. I'm hopeful to have a fix for this yet this week.

Main version on GH should fix this issue. I'm leaving open until I write a couple tests for this to better confirm. This fix does the physical page number of the PDF. The package does not try to extract the page number written into the PDF.

library(pdfsearch)
keyword_search(
    'https://cran.r-project.org/web/packages/tmap/tmap.pdf', 
    keyword = 'anneal', 
    path = TRUE, 
    ignore_case = TRUE
)

or

keyword_search(
    'https://www.tescoplc.com/media/757589/tesco_annual_report_2021.pdf', 
    keyword = 'Tesco was built to be a champion for customers', 
    path = TRUE, 
    ignore_case = TRUE, remove_equations = FALSE
)

should return 108 and 4 respectively.