extract text not working
Closed this issue · 3 comments
tcurdt commented
I tried this
func extractText(fname string) error {
fd, err := os.Open(fname)
if err != nil {
return err
}
defer fd.Close()
r, err := pdf.NewReader(fd, nil)
if err != nil {
return err
}
contents := reader.New(r, nil)
contents.Text = func(text string) error {
fmt.Print(text)
return nil
}
pages := pagetree.NewIterator(r)
pageNo := 0
pages.All()(func(_ pdf.Reference, pageDict pdf.Dict) bool {
fmt.Println("Page", pageNo)
err := contents.ParsePage(pageDict, matrix.Identity)
if err != nil {
log.Fatal(err)
}
pageNo++
return true
})
return nil
}
and was hoping this would extract text for machine generated PDFs or PDF with OCR information added - but it prints nothing on all the PDFs I tried. What am I missing?
seehuhn commented
This is work in progress, so probably you have just hit something I have not implemented yet. If you attach a PDF file which shows the problem, I can have a look.
tcurdt commented
I need to see if I can find a PDF that is acceptable to share. (I am sceptical there is one.)
FWIW I switched to https://github.com/ledongthuc/pdf which works pretty well. HTH.
seehuhn commented
I'm closing this for now. If you have a PDF file to share, please reopen.