This repo contains the code for a two-step process:
doc_scraper.py
- pulls basic document and page-level data from the voyages api
- saves this into a file
documents_pages.json
transkribus_pusher.py
- parses the
documents_pages.json
file - for each document, it then
- creates an upload job in Transkribus
- for each page in that document, it then
- downloads the full-sized jpeg pointed at in the json file
- saves it as {{primary_key}}.jpg
- pushes the jpg up to the transkribus server
- parses the
It was taking a long time, so I built in multithreading. Within each document, the pages are uploaded in parallel -- but the documents are handled serially.
You need a credentials.py file to access both transkribus and the sv api
Transkribus was failing on the large TIF's from the libraries affiliated with the South Seas project.
So we had to push up our JPEG collections in order to run HTR on these.
However, the best source for those JPEG's was still the libraries' IIIF endpoints. We have these pointers in our database, but were only holding the large TIF's locally.
This use iterates over all documents in the json file, and uploads the pages on a single-threaded process.
python transkribus_pusher.py
This use specifies a single document, and a number of worker processes
python transkribus_pusher.py --shortref="DOCP Huntington 57 17" --workers=5