Project: OCR (Optical Character Recognition)

Full Project Description

Term: Fall 2018
Team #8
Team members
- Bai, Ruoxi rb3313
- Loewenstein, Oded orl2108
- Yan, Jiaming jy2882
- Zhong, Qingyang qz2317 (presenter)
- Zhu, Siyu sz2716
Paper: D1 + C2

Project summary:

In this project, we created an OCR post-processing procedure to enhance Tesseract OCR output.

And here is our steps:
1. Preprocess the data, manually trimmed ground truth since there are 13 pairs of Tesseract and ground truth files that do not have the same number of lines.
1. Error Detection, use rule based method from paper D1.
1. Error Correction, first calculate 6 features scoring for each candidate based on assigned paper C2, then use AdaBoost.R2 model on top of decision trees with 0-1 loss function. Generate a prediction of top 3 best results as correction.
1. Evaluate detection preformance by calculating precision and recall for word-level. Then construct a confusion matrix.
1. Evaluated correction performance by calculating precision and recall for both word-level and character-level. Then calculate Top 3 candidate coverage.
1. Evaluate the algorithm as a whole.

Functions:

Four functions were implemented for different purposes of this project.
Detailed descriptions can be found here.

Contribution statement:

All team members contributed equally in all stages of this project. All team members approve our work presented in this GitHub repository including this contributions statement.

Bai, Ruoxi: Organized the structure of the whole project; Data preprocessing(Part 1&2); Error Detection; Regression Model; Evaluation(Detection & Correction) and visualization; Debugging; Code readibility.
Loewenstein, Oded: Feature: Levenshtein edit distance & feature: Lexicon existance; Tune parameters from regression model and visualization.
Yan, Jiaming: Feature: String similarity & feature: Language popularity.
Zhong, Qingyang: Data preprocessing(Part 1); Error Detection; Evaluation(Detection) and visualization; Combine all codes together; README; Code readibility; Presentation.
Zhu, Siyu: Candidate search; N-gram function; Feature: Exact-context popularity & feature: Relaxed-context popularity; Combine 6 features together.

Github Organization:

Following suggestions by RICH FITZJOHN (@richfitz). This folder is orgarnized as follows.

proj/
├── lib/
├── data/
├── doc/
├── figs/
└── output/

Please see each subfolder for a README file.

TZstatsADS/Fall2018-Project4-sec2-grp8

Project: OCR (Optical Character Recognition)

Full Project Description