Project: OCR (Optical Character Recognition)

Term: Fall 2018

Team # 3
Team members
- Samuel Kolins (sk3651)
- Sheng Wang (sw3224)
- Jiaxi Wu (jw3588)
- Yan Wang (yw3177)
- Wanyi Zheng (wz2409)
Paper: C1 + D3
Project summary: In this project, we created an OCR post-processing procedure to enhance Tesseract OCR output. The fisrt step is detecting garbage out from the OCR output by extracting features, also building and training SVM models to predict the garbage label for all tokens. For the error correction part, among those errors that detectable, the set of binary digrams can be used to attempt correction.

Contribution statement:

Following suggestions by RICH FITZJOHN (@richfitz). This folder is orgarnized as follows.

proj/
├── lib/
├── data/
├── doc/
├── figs/
└── output/

Please see each subfolder for a README file.

TZstatsADS/Fall2018-Project4-sec2-grp3