/locr

Fetches OCRed text for Library of Congress items.

Primary LanguagePythonCreative Commons Zero v1.0 UniversalCC0-1.0

README

This fetches full text from Library of Congress OCR files for LOC items. It returns the text, when found, and None otherwise.

Usage

It can take as input either a result item from a JSON API response or the URL of an item:

from locr import Fetcher

# From item or resource URL
Fetcher.full_text_from_url('https://www.loc.gov/resource/mss85943.001811/')

# From search result
# See https://libraryofcongress.github.io/data-exploration/requests.html
url = 'https://www.loc.gov/search/?fo=json&fa=subject:cats'
response = requests.get(url)
Fetcher(response['results'][0]).full_text()

Note that the above example is not guaranteed to work. In particular, not all objects have online text available.

Fetcher may raise the following exceptions:

  • ObjectNotOnline: when the object does not have any online formats.
  • AmbiguousText: when multiple fulltext options are found.
  • UnknownFormat: when locr is not sure how to handle the fulltext link's filetype.

If you encounter these exceptions, kindly file an issue or open a PR about the newly discovered edge case. Thanks.

Why LOCR?

The Library of Congress has put OCRed full text online for many of its items. However:

  • the API does not in general return the URLs to these items
  • OCRed text exists on different servers, with different URL formats; there is not one single way to construct the relevant URL for an item

While full text is easy to retrieve via the web site for a single item, perhaps you, like me, would like to fetch it programmatically.

Development

This package has a humiliating lack of tests, and I have done nothing to verify appropriate versions for dependencies. It really can use your help. PRs welcome.