-
Term: Fall 2018
-
Team #8
-
Team members
- Bai, Ruoxi rb3313
- Loewenstein, Oded orl2108
- Yan, Jiaming jy2882
- Zhong, Qingyang qz2317 (presenter)
- Zhu, Siyu sz2716
-
Paper: D1 + C2
Project summary:
-
In this project, we created an OCR post-processing procedure to enhance Tesseract OCR output.
And here is our steps:
-
- Preprocess the data, manually trimmed ground truth since there are 13 pairs of Tesseract and ground truth files that do not have the same number of lines.
-
- Error Detection, use rule based method from paper D1.
-
- Error Correction, first calculate 6 features scoring for each candidate based on assigned paper C2, then use AdaBoost.R2 model on top of decision trees with 0-1 loss function. Generate a prediction of top 3 best results as correction.
Functions:
- Four functions were implemented for different purposes of this project.
- Detailed descriptions can be found here.
Contribution statement:
All team members contributed equally in all stages of this project. All team members approve our work presented in this GitHub repository including this contributions statement.
- Bai, Ruoxi: Organized the structure of the whole project; Data preprocessing(Part 1&2); Error Detection; Regression Model; Evaluation(Detection & Correction) and visualization; Debugging; Code readibility.
- Loewenstein, Oded: Feature: Levenshtein edit distance & feature: Lexicon existance; Tune parameters from regression model and visualization.
- Yan, Jiaming: Feature: String similarity & feature: Language popularity.
- Zhong, Qingyang: Data preprocessing(Part 1); Error Detection; Evaluation(Detection) and visualization; Combine all codes together; README; Code readibility; Presentation.
- Zhu, Siyu: Candidate search; N-gram function; Feature: Exact-context popularity & feature: Relaxed-context popularity; Combine 6 features together.
Github Organization:
Following suggestions by RICH FITZJOHN (@richfitz). This folder is orgarnized as follows.
proj/
├── lib/
├── data/
├── doc/
├── figs/
└── output/
Please see each subfolder for a README file.