metachris/pdfx

URLs truncated at line endings

bitsgalore opened this issue · 4 comments

First of all: great tool! I did however come across a problem with URLs that span more than one line. I've attached a PDF that reproduces the problem here:

testpdfx.pdf

Command:

pdfx -v testpdfx.pdf -o testpdfx.txt

The URL in the footnote is extracted as::

http://jpylyzer.openpreservation.org//2016/01/06/Release-of-

Whereas this should be:

http://jpylyzer.openpreservation.org//2016/01/06/Release-of-jpylyzer-1-17-0

I used pdfx version 1.3.1 on Linux Mint.

Hi, I'm not sure if you are still working on this code. But on the chance that you are, I wanted to let you know that I also experience the same issue in pdfx v 1.3.1 that bitsgalore reported above.

I would love to see a solution to this issue. It is one of two problems that is stopping me from using pdfx for my academic research.

I see the same issue; reported good or 404 URLs are truncated at 20 characters when using the command format:
pdfx testpdfx.pdf -c

Same issue here. Lots of URLs are ignored or treated as invalid because they cover multiple lines in a PDF (especially when the lines are narrow). Please fix - this is a critical issue preventing me from using pdfx!