christian-vigh-phpclasses/PdfToText

gzuncompress(): data error

phisu opened this issue · 8 comments

phisu commented

i have some pdf-files which throw the following error when i try to extract the text:
$pdf = new PdfToText ($filename) ; echo $pdf->Text;
output: gzuncompress(): data error PdfToText.phpclass 1487

what can i do to prevent this error?
the pdf-file you can find here:
http://www2.ivm-rheinmain.de/wp-content/uploads/2012/02/Leitfaden_Maerz07.pdf

Hi,

Thanks for submitting this issue which shows yet another way to encode
images in a pdf file.

Don�t worry, your code is perfectly correct !

To tell the truth, for the moment, I�ll ask you to be a little bit patient !

In fact, when I implemented image extraction, I decided to throw an
exception when encountering unhandled ways of encoding images. My idea was
just to detect the various ways of doing that and that were not clearly
described in the pdf specifications. Although I�m currently handling only
jpeg images, I will have a look at your sample pdf file, because it presents
yet another way of encoding image data, and maybe it will help to understand
how

I will come back to you soon when I�ll figure out what happens.


De : phisu [mailto:notifications@github.com]
Envoyé : samedi 23 juillet 2016 09:42
À : christian-vigh-phpclasses/PdfToText
Objet : [christian-vigh-phpclasses/PdfToText] gzuncompress(): data error
(#6)

i have some pdf-files which throw the following error when i try to extract
the text:
$pdf = new PdfToText ($filename) ;
echo $pdf->Text;

output: gzuncompress(): data error PdfToText.phpclass 1487

what can i do to prevent this error?
the pdf-file you can find here:
http://www2.ivm-rheinmain.de/wp-content/uploads/2012/02/Leitfaden_Maerz07.pd
f


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it
#6 on
GitHub, or mute
<https://github.com/notifications/unsubscribe-auth/ARM8auJT66bQLHz0VDoxtJeh7
m5u0jfrks5qYcXfgaJpZM4JTTva> the thread.
<https://github.com/notifications/beacon/ARM8aqoMxsZQMRvq3wJM7R5kVCyXZCySks5
qYcXfgaJpZM4JTTva.gif>


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus

phisu commented

thank you for your quick answer.

what do you think about to introduce a switch, with which we could avoid the image extraction. is it possible to skip the image extraction?
or maybe your class could only return a comment, that some images are not extracted instead of throwing an exception.

this thoughts are maybe to simply. and honestly i have no understanding of the structure of pdf.

Yes, this is what I did in fact with the latest version (1.2.19, but check
the header comments instead of the git tags, because I have made a mistake
in versioning info and I think they are not in sync). With this version,
image data will only be extracted from the pdf file if the
PDFOPT_GET_IMAGE_DATA flags is specified as the $options parameter of the
constructor (or in the Options property, before calling the Load method) and
PDFOPT_DECODE_IMAGE_DATA if you want to transform them as a jpeg resource at
the same time.

Now, images are no more extracted by default, so it should run better in
some cases (you can download the latest version).

However, even with this default behavior, it seems that my class has a
problem with the sample pdf file you sent to me, so I have to investigate a
little bit on the origin of this problem.

I agree that throwing an exception when encountering bad image data is
definitely a really bad solution ; but this is only a temporary measure : as
there are multiple ways to encode images in pdf files (most of them being
unknown to me), I made the bet that relying on user experience when
receiving an exception would be a good way for me to get an overview on most
of the possible test cases that could happen.

As an example, another user got the same exception as you because his pdf
file contained images in adobe proprietary format ; he reported me the
problem and told me � like you � that he was not interested in extracting
images, and this is why I changed the default behavior of my class. But at
least throwing exception in such a case helped me to identify a new image
format that I did not handle. Of course, I still do not handle it but it has
been identified in my code, and my class does nothing when it encounters
such a case.

But� wait� I can enable exception throwing only if debug mode is enabled,
and silently ignore unrecognized image formats when not in debug mode !

Ok so as temporary conclusion to our current exchange :

  •      On your side, download the lastest version of my class and try
    

    again. Let me know the outcome of your testing

  •      On my side I will do the following :
    

o Change my class so that no exception will be thrown upon
unrecognized image formats if the PdfToText ::$DEBUG global variable is not
set to true

o Investigate the problem on the first sample you sent to me
(Leitfaden_Maerz07.pdf), because I suspect it is not clearly related to what
I explained above. but more on this later�

In any case, I will come back to you when a new version will be available (I
hope this to be ready by tomorrow evening).

Christian.


De : phisu [mailto:notifications@github.com]
Envoyé : samedi 23 juillet 2016 20:57
À : christian-vigh-phpclasses/PdfToText
Cc : christian-vigh-phpclasses; Comment
Objet : Re: [christian-vigh-phpclasses/PdfToText] gzuncompress(): data error
(#6)

thank you for your quick answer.

what do you think about to introduce a switch, with which we could avoid the
image extraction. is it possible to skip the image extraction?
or maybe your class could only return a comment, that some images are not
extracted instead of throwing an exception.

this thoughts are maybe to simply. and honestly i have no understanding of
the structure of pdf.


You are receiving this because you commented.
Reply to this email directly, view
<#6 (comment)
t-234734251> it on GitHub, or mute
<https://github.com/notifications/unsubscribe-auth/ARM8amRPHvWvEpytuZjpp2xrU
5HGEwD_ks5qYmQVgaJpZM4JTTva> the thread.
<https://github.com/notifications/beacon/ARM8auCodj4SVzR4gZ99mAPn40aR6IPLks5
qYmQVgaJpZM4JTTva.gif>


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus

phisu commented

thank you a lot for you very quick answer!

i downloaded the latest version of your class ( [Version : 1.2.19] [Date : 2016/07/19] ) and made a test with the same pdf.


$filename = 'Leitfaden_Maerz07.pdf';
$pdf    =  new PdfToText ($filename) ;    
echo $pdf->Text;


in the browswer i got no output but in der apache-log the following error:

PHP Fatal error: Uncaught exception 'PdfToTextException' with message 'Pdf decoding error (object #425) : Invalid gzip data.' in PdfToText.php:1490\nStack trace:\n#0 PdfToText.php(1078): PdfToText->DecodeData(425, '\\x08\\xC0\\xC5\\xDFe\\x1C~\\xBC\\x84\\x1A\\x7F\\xB5+\xA1...', 3)\n#1 PdfToText.php(935): PdfToText->Load('Leitfaden_Maerz07.pdf')\n#2 test.php(26): PdfToText->__construct('...')\n#3 {main}\n thrown in PdfToText.php on line 1490

`
i give you an other pdf file, which produce the same error. maybe this helps to find what is going wrong.
http://wiki.iao.fraunhofer.de/images/studien/green-office.pdf

philipp.

Hi Philipp,

I thank you for this additional work and for sending me a second sample (I
will have more chances to identify the issue that way).

With this latest version of my class you�ve got a different error message
but its simply because I slightly change dit (as well as the way to handle
such errors).

This is an interesting case ; it is clearly not linked to image extraction,
since it�s disabled by default with my latest version. I suspect that some
part of the pdf file has been mistakenly recognized as containing something
like character maps or drawing instructions in compressed format, but that
it does not contain at all gzipped data.

I thank you for the testing, since it gave me additional information.

I�ll come back to you when this issue will be solved.

Christian.


De : phisu [mailto:notifications@github.com]
Envoyé : dimanche 24 juillet 2016 08:16
À : christian-vigh-phpclasses/PdfToText
Cc : christian-vigh-phpclasses; Comment
Objet : Re: [christian-vigh-phpclasses/PdfToText] gzuncompress(): data error
(#6)

thank you a lot for you very quick answer!

i downloaded the latest version of your class ( [Version : 1.2.19] [Date :
2016/07/19] ) and made a test with the same pdf.

$filename = 'Leitfaden_Maerz07.pdf';
$pdf = new PdfToText ($filename) ;
echo $pdf->Text;

in the browswer i got no output but in der apache-log the following error:

PHP Fatal error: Uncaught exception 'PdfToTextException' with message 'Pdf
decoding error (object #425) : Invalid gzip data.' in
PdfToText.php:1490\nStack trace:\n#0 PdfToText.php(1078):
PdfToText->DecodeData(425,
'\x08\xC0\xC5\xDFe\x1C~\xBC\x84\x1A\x7F\xB5+\xA1...', 3)\n#1
PdfToText.php(935): PdfToText->Load('Leitfaden_Maerz07.pdf')\n#2
test.php(26): PdfToText->__construct('...')\n#3 {main}\n thrown in
PdfToText.php on line 1490

`
i give you an other pdf file, which produce the same error. maybe this helps
to find what is going wrong.
http://wiki.iao.fraunhofer.de/images/studien/green-office.pdf

philipp.


You are receiving this because you commented.
Reply to this email directly, view
<#6 (comment)
t-234760220> it on GitHub, or mute
<https://github.com/notifications/unsubscribe-auth/ARM8atzGF4REIlUDKNzQukAkI
UIkvdl4ks5qYwMXgaJpZM4JTTva> the thread.
<https://github.com/notifications/beacon/ARM8avZFdM5iLT1xeHixVHm-RPZ9ckSlks5
qYwMXgaJpZM4JTTva.gif>


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus

Ok I�ve progressed a little bit on this issue. The files have been generated
with Adobe Acrobat Distiller, and any piece that should normally be encoded
in gzip format (which can be uncompressed by the standard gzuncompress() PHP
function) seems to be encoded in a different format, which seems
Adobe-specific.

I changed my class not to throw an exception when such an encoding method is
encountered and the PdfToText ::$DEBUG global variable is set to false
(which is the default value).

However, my class is unable to extract anything from your samples : even the
text-drawing instructions are compressed in such a specific format, so the
Text property is empty.

I already found such a situation in one or two samples, but it did not
concern text drawing instructions.

So what I have to do now is to find some reliable documentation about what
seems to me to be a strange compression format, then implement it� More on
this later !

Christian.


De : phisu [mailto:notifications@github.com]
Envoyé : dimanche 24 juillet 2016 08:16
À : christian-vigh-phpclasses/PdfToText
Cc : christian-vigh-phpclasses; Comment
Objet : Re: [christian-vigh-phpclasses/PdfToText] gzuncompress(): data error
(#6)

thank you a lot for you very quick answer!

i downloaded the latest version of your class ( [Version : 1.2.19] [Date :
2016/07/19] ) and made a test with the same pdf.

$filename = 'Leitfaden_Maerz07.pdf';
$pdf = new PdfToText ($filename) ;
echo $pdf->Text;

in the browswer i got no output but in der apache-log the following error:

PHP Fatal error: Uncaught exception 'PdfToTextException' with message 'Pdf
decoding error (object #425) : Invalid gzip data.' in
PdfToText.php:1490\nStack trace:\n#0 PdfToText.php(1078):
PdfToText->DecodeData(425,
'\x08\xC0\xC5\xDFe\x1C~\xBC\x84\x1A\x7F\xB5+\xA1...', 3)\n#1
PdfToText.php(935): PdfToText->Load('Leitfaden_Maerz07.pdf')\n#2
test.php(26): PdfToText->__construct('...')\n#3 {main}\n thrown in
PdfToText.php on line 1490

`
i give you an other pdf file, which produce the same error. maybe this helps
to find what is going wrong.
http://wiki.iao.fraunhofer.de/images/studien/green-office.pdf

philipp.


You are receiving this because you commented.
Reply to this email directly, view
<#6 (comment)
t-234760220> it on GitHub, or mute
<https://github.com/notifications/unsubscribe-auth/ARM8atzGF4REIlUDKNzQukAkI
UIkvdl4ks5qYwMXgaJpZM4JTTva> the thread.
<https://github.com/notifications/beacon/ARM8avZFdM5iLT1xeHixVHm-RPZ9ckSlks5
qYwMXgaJpZM4JTTva.gif>


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus

Hi again Philipp,

Ok I found out what happens. Here is what I tried :

  •      Save the file using Acrobat Reader ; nothing changed, no text
    

    extraction happens

  •      Print the file using PdfCreator : it simply failed, displaying an
    

    error message saying that there was a conversion error !

  •      It also failed with PrimoPdf
    
  •      However I have been more successful with PdfPro 10 : simply print
    

    your file, run the PdfToText class on the result, and you will see your
    text.

Of course, this is not acceptable : it just helped me understand what
happens. The PdfPro 10 software just removed encryption before generating
the output file.

In fact I already knew that Pdf files can be password-protected (and
handling password-protected pdf files is on my to-do list). However, all the
data in your samples have been encrypted � but no password is required to be
able to read them with Acrobat - and this is why there is an « invalid gzip
data » error ; this is because the gzipped data needs to be decrypted before
being uncompressed. This is yet another new case I have to handle.

Ok, this will require me a few days to solve this issue but it is a really
interesting case that will help me to go a step further for handling
password-protected files (note that I do not intend to provide a
password-cracking solution !).

Christian.


De : phisu [mailto:notifications@github.com]
Envoyé : dimanche 24 juillet 2016 08:16
À : christian-vigh-phpclasses/PdfToText
Cc : christian-vigh-phpclasses; Comment
Objet : Re: [christian-vigh-phpclasses/PdfToText] gzuncompress(): data error
(#6)

thank you a lot for you very quick answer!

i downloaded the latest version of your class ( [Version : 1.2.19] [Date :
2016/07/19] ) and made a test with the same pdf.

$filename = 'Leitfaden_Maerz07.pdf';
$pdf = new PdfToText ($filename) ;
echo $pdf->Text;

in the browswer i got no output but in der apache-log the following error:

PHP Fatal error: Uncaught exception 'PdfToTextException' with message 'Pdf
decoding error (object #425) : Invalid gzip data.' in
PdfToText.php:1490\nStack trace:\n#0 PdfToText.php(1078):
PdfToText->DecodeData(425,
'\x08\xC0\xC5\xDFe\x1C~\xBC\x84\x1A\x7F\xB5+\xA1...', 3)\n#1
PdfToText.php(935): PdfToText->Load('Leitfaden_Maerz07.pdf')\n#2
test.php(26): PdfToText->__construct('...')\n#3 {main}\n thrown in
PdfToText.php on line 1490

`
i give you an other pdf file, which produce the same error. maybe this helps
to find what is going wrong.
http://wiki.iao.fraunhofer.de/images/studien/green-office.pdf

philipp.


You are receiving this because you commented.
Reply to this email directly, view
<#6 (comment)
t-234760220> it on GitHub, or mute
<https://github.com/notifications/unsubscribe-auth/ARM8atzGF4REIlUDKNzQukAkI
UIkvdl4ks5qYwMXgaJpZM4JTTva> the thread.
<https://github.com/notifications/beacon/ARM8avZFdM5iLT1xeHixVHm-RPZ9ckSlks5
qYwMXgaJpZM4JTTva.gif>


L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel antivirus Avast.
https://www.avast.com/antivirus

phisu commented

hi christian,

thank you for your efforts. i hope you will find a solution, because i
tried several classes to extract the text from pdf. and it seems to me
that your class is the best. let me know when i can help you.

philipp

Am 2016-07-25 um 00:39 schrieb christian-vigh-phpclasses:

Hi again Philipp,

Ok I found out what happens. Here is what I tried :

  • Save the file using Acrobat Reader ; nothing changed, no text
    extraction happens
  • Print the file using PdfCreator : it simply failed, displaying an
    error message saying that there was a conversion error !
  • It also failed with PrimoPdf
  • However I have been more successful with PdfPro 10 : simply print
    your file, run the PdfToText class on the result, and you will see your
    text.

Of course, this is not acceptable : it just helped me understand what
happens. The PdfPro 10 software just removed encryption before generating
the output file.

In fact I already knew that Pdf files can be password-protected (and
handling password-protected pdf files is on my to-do list). However,
all the
data in your samples have been encrypted � but no password is required
to be
able to read them with Acrobat - and this is why there is an « invalid
gzip
data » error ; this is because the gzipped data needs to be decrypted
before
being uncompressed. This is yet another new case I have to handle.

Ok, this will require me a few days to solve this issue but it is a really
interesting case that will help me to go a step further for handling
password-protected files (note that I do not intend to provide a
password-cracking solution !).

Christian.


De : phisu [mailto:notifications@github.com]
Envoyé : dimanche 24 juillet 2016 08:16
À : christian-vigh-phpclasses/PdfToText
Cc : christian-vigh-phpclasses; Comment
Objet : Re: [christian-vigh-phpclasses/PdfToText] gzuncompress(): data
error
(#6)

thank you a lot for you very quick answer!

i downloaded the latest version of your class ( [Version : 1.2.19] [Date :
2016/07/19] ) and made a test with the same pdf.

$filename = 'Leitfaden_Maerz07.pdf';
$pdf = new PdfToText ($filename) ;
echo $pdf->Text;

in the browswer i got no output but in der apache-log the following error:

PHP Fatal error: Uncaught exception 'PdfToTextException' with message 'Pdf
decoding error (object #425) : Invalid gzip data.' in
PdfToText.php:1490\nStack trace:\n#0 PdfToText.php(1078):
PdfToText->DecodeData(425,
'\x08\xC0\xC5\xDFe\x1C~\xBC\x84\x1A\x7F\xB5+\xA1...', 3)\n#1
PdfToText.php(935): PdfToText->Load('Leitfaden_Maerz07.pdf')\n#2
test.php(26): PdfToText->__construct('...')\n#3 {main}\n thrown in
PdfToText.php on line 1490

`
i give you an other pdf file, which produce the same error. maybe this
helps
to find what is going wrong.
http://wiki.iao.fraunhofer.de/images/studien/green-office.pdf

philipp.


You are receiving this because you commented.
Reply to this email directly, view
<#6 (comment)
t-234760220> it on GitHub, or mute
<https://github.com/notifications/unsubscribe-auth/ARM8atzGF4REIlUDKNzQukAkI
UIkvdl4ks5qYwMXgaJpZM4JTTva> the thread.
<https://github.com/notifications/beacon/ARM8avZFdM5iLT1xeHixVHm-RPZ9ckSlks5
qYwMXgaJpZM4JTTva.gif>


L'absence de virus dans ce courrier électronique a été vérifiée par le
logiciel antivirus Avast.
https://www.avast.com/antivirus


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#6 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AC-Jo3f6NYxKSfoNH0V-uvG_cxygOxDxks5qY-mcgaJpZM4JTTva.