documentcloud/docsplit

Detect page orientation and rotate when necessary

lukerosiak opened this issue · 5 comments

Just add this flag to the tesseract command line call and results for pages with vertical text will be perfect instead of failing completely.

tesseract [other args] -psm 1

This would be great to add. Docs for what the options do are available here:

http://tesseract-ocr.googlecode.com/svn-history/r725/trunk/doc/tesseract.1.html

Bump because there's some traffic on NICAR-L about this.

I've had intermittent issues with 3.02 that have caused tesseract to segfault when certain flags are set. But the 3.03 may be more reliable. I've also done some playing around w/ ruby-ocr to get programmatic access to the confidence scores on the various blocks of text on pages (which we don't have access to via the commandline interface that Docsplit uses).

We still should definitely do this. Pull requests absolutely welcome & encouraged.

cc/ @AbeHandler @jsfenfen @zstumgoren

Maybe also a button in the UI that allowed you to rotate a contract 90 degrees? Then you could fix these manually in the UI too if need be.

Alright, just cut a release for the -psm 1 mode. Woulda been earlier but my ISP was out for about 3 hrs.

Also thanks to @AbeHandler for getting the initial spike in.

Nice!

On Nov 17, 2014, at 2:42 AM, Ted Han notifications@github.com wrote:

Alright, just cut a release for the -psm 1 mode. Woulda been earlier but my ISP was out for about 3 hrs.


Reply to this email directly or view it on GitHub.