seehuhn/go-pdf

extract text not working

Closed this issue · 3 comments

I tried this

func extractText(fname string) error {
    fd, err := os.Open(fname)
    if err != nil {
        return err
    }
    defer fd.Close()

    r, err := pdf.NewReader(fd, nil)
    if err != nil {
        return err
    }

    contents := reader.New(r, nil)
    contents.Text = func(text string) error {
        fmt.Print(text)
        return nil
    }

    pages := pagetree.NewIterator(r)
    pageNo := 0
    pages.All()(func(_ pdf.Reference, pageDict pdf.Dict) bool {
        fmt.Println("Page", pageNo)

        err := contents.ParsePage(pageDict, matrix.Identity)
        if err != nil {
            log.Fatal(err)
        }

        pageNo++
        return true
    })
    return nil
}

and was hoping this would extract text for machine generated PDFs or PDF with OCR information added - but it prints nothing on all the PDFs I tried. What am I missing?

This is work in progress, so probably you have just hit something I have not implemented yet. If you attach a PDF file which shows the problem, I can have a look.

I need to see if I can find a PDF that is acceptable to share. (I am sceptical there is one.)

FWIW I switched to https://github.com/ledongthuc/pdf which works pretty well. HTH.

I'm closing this for now. If you have a PDF file to share, please reopen.