Images to Text: A Gentle Introduction to Optical Character Recognition with PyTesseract

Description

A 2021 Text Analysis Pedagogy Institute course.

Instructor: Hannah Jacobs

This course will introduce the concept of “Optical Character Recognition” (OCR), various tools available for performing OCR, and important considerations for successfully OCRing digitized text. Using Tesseract in Python, we’ll walk through the entire process using a variety of examples to show the range of challenges scholars can face when performing OCR. By the end of the course, participants should be able to use the course’s Jupyter Notebooks to perform OCR on their own; they should be able to identify possible technical challenges presented by specific texts and propose potential solutions; and they should be able to assess the degree of accuracy they have achieved in performing OCR.

Land Acknowledgment

These materials were prepared and are presented on the ancestral homelands of the Haliwa-Saponi, Sappony, and Occaneechi Band of the Saponi nations, whose lands are now known as Durham, North Carolina. This acknowledgement reminds us of the significance of place even in a virtual space, and of our ongoing need to build a more inclusive and equitable society.

Learn more about land acknowledgments. Learn about the Occaneechi Band of the Saponi Nation Homeland Preservation project.

Lessons

License

These materials are licensed under a Creative Commons BY license. You are free to share and adapt the materials for your own teaching so long as credit is given to the creators, the material is labeled with a CC BY License, and you indicate if changes were made.

Citation

Use the following text with specific lessons replacing the bracketed phrases:

This lesson is based on [Lesson name] and [repository link] from the 2021 Text Analysis Pedagogy CC BY, [Instructor-First-Name] [Instructor-Last-Name].

nkelber/tapi_2021_ocr