ledongthuc/pdf

Won´t open some PDFs

marco-zanon opened this issue · 1 comments

While the package opens normally most of the PDFs files, it encounters problems opening some files, instead returning a "panic: malformed PDF: reading at offset 0: stream not present" error.

For example, the file "SP 10-2019 Relatório Analítico de Composições de Custos.pdf" (which you can get in the url "https://www.gov.br/dnit/pt-br/assuntos/planejamento-e-pesquisa/custos-e-pagamentos/custos-e-pagamentos-dnit/sistemas-de-custos/sicro/sudeste/espirito-santo/2019/outubro-1/es-outubro-2019.zip", after extracting the zip file) won´t open with your "github.com/ledongthuc/pdf" package, but opens normally with any PDF reader (like Adobe Reader, for instance).

FWIW, the entire error message that I get while trying to open the file is:

panic: malformed PDF: reading at offset 0: stream not present

goroutine 1 [running]:
github.com/ledongthuc/pdf.(*buffer).errorf(...)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/pdf@v0.0.0-20200323191019-23c5852adbd2/lex.go:82
github.com/ledongthuc/pdf.(*buffer).reload(0xc04c7db790, 0x0)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/pdf@v0.0.0-20200323191019-23c5852adbd2/lex.go:95 +0x1fe
github.com/ledongthuc/pdf.(*buffer).readByte(0xc04c7db790, 0xc0003ff9d0)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/pdf@v0.0.0-20200323191019-23c5852adbd2/lex.go:71 +0x67
github.com/ledongthuc/pdf.(*buffer).readToken(0xc04c7db790, 0xc0732d6260, 0x1000)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/pdf@v0.0.0-20200323191019-23c5852adbd2/lex.go:135 +0x47
github.com/ledongthuc/pdf.Interpret(0x0, 0x0, 0x0, 0x0, 0xc04c7db930)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/pdf@v0.0.0-20200323191019-23c5852adbd2/ps.go:64 +0x1ae
github.com/ledongthuc/pdf.Page.Content(0xc04f7395c0, 0x48, 0x4dad60, 0xc073356000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/pdf@v0.0.0-20200323191019-23c5852adbd2/page.go:816 +0x2db
main.extraiPDFAnalitico(0x539921, 0x49, 0x0)
C:/Users/mazrodrigues/Documents/04_Golang/04_Faro/02_Tabela_Referencia/01_SICRO/04_trata_revisionais/v02/pdf_analitico.go:50 +0x165
main.main()
C:/Users/mazrodrigues/Documents/04_Golang/04_Faro/02_Tabela_Referencia/01_SICRO/04_trata_revisionais/v02/main.go:18 +0xa8
exit status 2

A workaround is to clean the PDF with mupdf before using this package:

$ sudo apt-get install mupdf-tools

$ mutool clean -s dirty.pdf clean.pdf

clean.pdf should now work with this package.

Credit @YspCoder

Another option is to use a Go wrapper around MuPDF to extract text from PDF: https://github.com/gen2brain/go-fitz