/Fall2018-Project4-sec2-grp3

Fall2018-Project4-sec2--sec2proj4_grp3 created by GitHub Classroom

Primary LanguageR

Project: OCR (Optical Character Recognition)

image

Term: Fall 2018

  • Team # 3

  • Team members

    • Samuel Kolins (sk3651)
    • Sheng Wang (sw3224)
    • Jiaxi Wu (jw3588)
    • Yan Wang (yw3177)
    • Wanyi Zheng (wz2409)
  • Paper: C1 + D3

  • Project summary: In this project, we created an OCR post-processing procedure to enhance Tesseract OCR output. The fisrt step is detecting garbage out from the OCR output by extracting features, also building and training SVM models to predict the garbage label for all tokens. For the error correction part, among those errors that detectable, the set of binary digrams can be used to attempt correction.

Contribution statement:

  • Error detection: Wanyi Zheng(main), Samuel Kolins, Yan Wang and Sheng Wang
  • Error correction: Jiaxi Wu
  • Evaluation: Wanyi Zheng(main), Jiaxi Wu(main), Sheng Wang and Samuel Kolins
  • Presentation: Samuel Kolins

Following suggestions by RICH FITZJOHN (@richfitz). This folder is orgarnized as follows.

proj/
├── lib/
├── data/
├── doc/
├── figs/
└── output/

Please see each subfolder for a README file.