The goal of rmgarbage is to remove strings obtained from OCR engines which are garbage. It contains functions that implement the methods described by:
- Taghva et al. (2001) “Automatic Removal of Garbage Strings in OCR Text: An implementation” http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.81.8901
- Yang Cai (2008) “OCR Output Enhancement” https://ladyissy.github.io/OCR/
The code was inspired by Python code at https://github.com/foodoh/rmgarbage and JavaScript code at https://github.com/Amoki/rmgarbage.
You can install rmgarbage from GitHub with:
remotes::install_github("benmarwick/rmgarbage")
This is a basic example which shows you how to solve the problem of identifing bad OCR.
library(rmgarbage)
Here is an example of output on a good ocr:
good_ocr <- "This document was scanned perfectly"
good_ocr_split <- strsplit(good_ocr, " ")[[1]]
sapply(good_ocr_split, rmgarbage)
#> This document was scanned perfectly
#> FALSE FALSE FALSE FALSE FALSE
And here is an example of output on a bad ocr:
bad_ocr <- "This 3ccm@nt w&s scnnnnd not pe&;c1!y"
bad_ocr_ocr_split <- strsplit(bad_ocr, " ")[[1]]
sapply(bad_ocr_ocr_split, rmgarbage)
#> This 3ccm@nt w&s scnnnnd not pe&;c1!y
#> FALSE TRUE TRUE TRUE FALSE TRUE
If you would like to contribute to this project, please start by reading our Guide to Contributing. Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.