DownLoad Full Text

A tool to get the full text of a multipage item in Harvard Digital Collections

Description

In version 1 of Harvard Digital Collections, users who want the full OCR text of a multi-page item have to copy it from a popup in the mirador viewer, which takes multiple clicks and is time-consuming for more than a few pages.

dlft is a proof of concept for single-click full-text download. The script takes the URL of a Harvard Digital Collections item page as input. After locating the Digital Repository Service (DRS) identifier on the HDC page, the script calls the endpoint https://pds.lib.harvard.edu/pds/get/ to get the text for each page and concatenates them all into a single TXT file, saved in /Results/.

Dependencies

The program was made with Python 3.7.3 and needs the following modules installed in the run environment:

requests
bs4
tqdm

To install python modules, you can use pip with this syntax at a bash console: pip install <name of module>

How to run the program

Check to make sure you have the above external modules installed.
Change the value of HDC_url at the top of dlft.py to the desired URL. Example:

HDC_url = 'https://digitalcollections.library.harvard.edu/catalog/990043816950203941'

Run dlft.py in a bash console with python dlft.py.
Wait. It will take a while, depending on the length of the book. The page delivery service seems to be able to return about 2-3 pages per second.

Options

You can set the page range manually to get only the OCR for a specified range. Change False to True and change the numbers in these lines:

manual_pagination = False
manual_page_start = 1
manual_page_end = 11

Future development