Add option to generate hOCR output from tesseract
Closed this issue · 5 comments
It'd be nice to have the ability to generate hocr output when running ocr via tesseract. I have a patch and will send the pull request.
This is a really good idea. +1
@jsfenfen Yep, been talking with @lukerosiak about this. Definitely want to get it into the lib.
It is indeed as simple as adding the hocr flag to the tesseract call, no config file appears to be required. But then you have to turn the hocr (html) into text, since I don't think you can get tesseract to
produce both. To turn hocr into text, something like this (python):
'''from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(fin)
grafs = soup.findAll('p',{'class':'ocr_par'})
for graf in grafs:
print ''
lines = graf.findAll('span',{'class':'ocr_line'})
for line in lines:
print ''.join(line.findAll(text=True))'''
On Tue, Jun 11, 2013 at 12:20 PM, Ted Han notifications@github.com wrote:
@jsfenfen https://github.com/jsfenfen Yep, been talking with @lukerosiakhttps://github.com/lukerosiakabout this. Definitely want to get it into the lib.
—
Reply to this email directly or view it on GitHubhttps://github.com//issues/80#issuecomment-19272844
.
I'm doing something similar in ruby, as I want to have both outputs. If you'd like, I can add something like this to the patch. The only downside is that it doesn't do any text cleaning (currently), though I'm sure that could be added.
def emit_text(page)
doc = Nokogiri::HTML(File.open("#{page}.html"))
File.open("#{page}.txt", "w") do |out|
pos = 0
doc.css('.ocr_par').each do |par|
par.css('.ocr_line').each do |line|
line.css('.ocrx_word').each do |word|
out.write("#{word.text} ")
start = pos
stop = start + word.text.length
word['start'] = start
word['stop'] = stop
pos += word.text.length + 1
end
out.write("\n")
pos += 1
end
out.write("\n")
pos += 1
end
end
File.open("#{page}.html", "w").write(doc.to_html)
end