Print page's text when pdfxmeta command fails

Question

Print page's text when pdfxmeta command fails

dalanicolai opened this issue 4 years ago · 6 comments

I have a document for which I am trying to define the levels of headings in the pdf using the pdfxmeta command.
However, as you can see for from the example page here using mutool draw -F text out.pdf 1 from the command line:
out.pdf
for some reason the text extracted from that page by mupdf reads as follows:

'Chapter\n1\nIn\ntro\nduction\n... etc.'

i.e. , the text is broken up with new line characters. Now when I give a complete word to the pdfxmeta command, it does not find anything and fails to set a level. However when I provide just that part of a word that is between two newline characters (e.g. 'duction') then subsequently the pdftocgen command works just fine. Of course I only got here by debugging, but it would be great if the pdfxmeta command prints the page's text when it fails to find the given pattern (or otherwise maybe extend the documentation to use the mutool draw -F text ... command).

Thanks for the beautiful package!

ps. I have integrated the functionality of pdf.tocgen to Emacs's toc-mode. You might like toc-mode's remaining functionalities too. I am not sure if you are using Emacs, but if you are using vim you might like to try out Spacemacs (check out to develop branch immediately after downloading).

Answer 1 · 2020-09-09T13:35:49.000Z

I'm glad that you find this tool helpful!

I actually mentioned in the documentation that if you leave out the search pattern

$ pdfxmeta doc.pdf

pdfxmeta will dump the entire document/page. But the rationale is that if you can't find the pattern you are searching for, it is very likely that the structure of the pdf is messed up and pdftocgen can't generate a meaningful outline for the document.

The document you are pointing to does not seems to be produced by pdflatex directly, probably converted from postscript, since glyph outlines are rasterized.

Answer 2 · 2020-09-09T15:15:59.000Z

Just after I created this issue, I realized that I did not read the full documentation for pdf.tocgen, I just skimmed it quickly so that I could test it. I still prefer to show the page text automatically when defining the level with pdfxmeta fails, which is what I am doing in toc-mode. Well, this is subjective of course.
Still I also created the issue because I wanted to notify you about the integration of pdf-tocgen in toc-mode, and because the toc-mode is really quite elegant and powerful and a very nice extension to pdf-tocgen.

Answer 3 · 2020-09-09T15:19:13.000Z

I realize now that maybe you did not see the notification about toc-mode because I edited the issue later which you probably missed if you only checked your e-mail. Well, if you are interested then you can read my original issue/report here on github.

Answer 4 · 2020-09-14T04:34:10.000Z

I still prefer to show the page text automatically when defining the level with pdfxmeta fails.

The purpose of pdfxmeta is to search for a pattern in a pdf file and print out the metadata of matching strings. If a pattern is not found, the most natural response would be to say that there is no such pattern in the file and hence to print nothing. This is consistent with the behavior of grep. I think it would be very strange for grep to print out the entire input when nothing is found.

I was expecting pdfxmeta to be executed multiple times to find the metadata you need. The default output is for reading, so it is not a valid toml. The auto mode should be used when you are certain that the output is exactly what you are looking for, though it is usually not the case since there can be multiple or no matches. It makes sense to provide a hint to the user in a GUI application, but overloading the output would be confusing for a single-purpose command-line application.

I wanted to notify you about the integration of pdf-tocgen in toc-mode.

Nice! I have added link to toc-mode in the readme. You should probably mention that pdf-view-mode is required to use the toc-gen-set-level command. The default doc-view-mode renders pdf as images so I can't select anything.

I was also getting a

invalid-function pdf-view-current-page

error when using toc-gen-set-level similar to politza/pdf-tools#210. It can be fixed with the same modification, but I can't reproduce it any more.

I would also suggest you make a minimal walk-through using plain Emacs similar to the On Lisp example of pdf.tocgen, showing all the steps you need from opening the pdf file to importing toc back to pdf. It really helps piece everything together.

Answer 5 · 2020-09-26T21:06:54.000Z

Hi, thanks for the nice answer and sorry for the very late reply. I finally have read the full user manual of pdf-tocgen, and I will add some extra options and documentation to toc-mode. Also thanks for linking to toc-mode from your README. The information provided with the link is a little misleading, because toc-mode mainly uses the epdfinfo server and the poppler and djvused command line tools. Toc-mode already existed before pdf.tocgen was published and its main functionality was/is to extract a TOC from the printed TOC inside a pdf or djvu document (optionally via OCR) and to help clean up the contents, correctly set the pagenumbers and attach it to the document. Then I extended the toc-mode functionality with the functionality of this package when I found it. But pdf.tocgen is not required to be installed for using most of toc-mode's functionality. Finally, thank you for the advice about the documentation and notifying me about the pdf-view-current-page issue and its solution. When I find some time for it, then I will indeed extend the documentation to more resemble the excellent documentation for pdf.tocmode (although I do expect some inventiveness from Emacs/Spacemacs users also ;)

Answer 6 · 2020-09-27T02:16:20.000Z

Anyway, since there are no more questions. I'll close this issue now.