documentcloud/docsplit

Add option to generate hOCR output from tesseract

Closed this issue · 5 comments

It'd be nice to have the ability to generate hocr output when running ocr via tesseract. I have a patch and will send the pull request.

This is a really good idea. +1

@jsfenfen Yep, been talking with @lukerosiak about this. Definitely want to get it into the lib.

It is indeed as simple as adding the hocr flag to the tesseract call, no config file appears to be required. But then you have to turn the hocr (html) into text, since I don't think you can get tesseract to
produce both. To turn hocr into text, something like this (python):

'''from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(fin)

grafs = soup.findAll('p',{'class':'ocr_par'})

for graf in grafs:

print ''

lines = graf.findAll('span',{'class':'ocr_line'})

for line in lines:

    print ''.join(line.findAll(text=True))'''

On Tue, Jun 11, 2013 at 12:20 PM, Ted Han notifications@github.com wrote:

@jsfenfen https://github.com/jsfenfen Yep, been talking with @lukerosiakhttps://github.com/lukerosiakabout this. Definitely want to get it into the lib.


Reply to this email directly or view it on GitHubhttps://github.com//issues/80#issuecomment-19272844
.

I'm doing something similar in ruby, as I want to have both outputs. If you'd like, I can add something like this to the patch. The only downside is that it doesn't do any text cleaning (currently), though I'm sure that could be added.

def emit_text(page)
  doc = Nokogiri::HTML(File.open("#{page}.html"))
  File.open("#{page}.txt", "w") do |out|
    pos = 0
    doc.css('.ocr_par').each do |par|
      par.css('.ocr_line').each do |line|
        line.css('.ocrx_word').each do |word|
          out.write("#{word.text} ")
          start = pos
          stop = start + word.text.length
          word['start'] = start
          word['stop'] = stop
          pos += word.text.length + 1
        end
        out.write("\n")
        pos += 1
      end
      out.write("\n")
      pos += 1
    end
  end
  File.open("#{page}.html", "w").write(doc.to_html)
end

Closing in lieu of #92