/pictor

Discovering IIIF manifests

Primary LanguageGoMIT LicenseMIT

Discovering IIIF Manifests with Pictor

screenshot.png

Idea

Discovering IIIF resources can be challenging.

Although the protocol does specify a dedicated Discovery API it is not often implemented by institutions. (At Anet we are guilty of the same). Moreover, this API has no straightforward way to obtain a full collection. It is certainly not as straightforward as with OAI-PMH for instance, that offers the verb ListIdentifiers.

The IIIF documentation does have an interesting Guide to finding IIIF resources, which features a list of IIIF collections. Similar sources are:

With that information I was able to scrape several of these collections and aggregate them into a corpus of about 6.5 million IIIF manifests. The resulting lists are available in this repository.

This repository has two purposes. One it offers a place to store IIIF collections and make them available for others. Two it uses those collections to host a discovery tool with a sample of them.

Currently, it features manifests of the following institutions / collections:

(* = No sample in the discovery tool yet)

Harvesting

Harvesting the IIIF manifests was done with Python scripts in a variety of ways.

Many institutions, like the Bayerische Staatsbibliothek or the University of Toronto let you scrape collections from their Presentation API. Others, like Digital Commonwealth have OAI-PMH that get you the necessary identifiers. Still others, like the Getty Institute or Wikidata offer a SPARL endpoint.

I harvested all manifests I could find for the repository and also made random 1,000 manifest samples of the collections for the discovery tool. For this, good old Unix tools are still amazingly good:

sort digitalcommonwealth.txt | uniq | shuf | head -n 5000 > digitalcommonwealth_sample.txt

(Update 24 November 2022: at this point in the project, I was forced to switch from 5K to 1K samples from these collections, because of the sheer volume of the material. So at the moment the discovery tool's results are somewhat skewed because some collections are more represented than others. I plan to remedy this in the future with a full re-run, but it will be a while before I get round to this)

Indexing

Requesting and indexing the IIIF manifests was done with a Go script (since Go is really strong for concurrency) and the result was piped as triple statements from stdout to a plain textfile. (I found this a handy alternative to having to set up a database like PostgreSQL or something similar that could handle concurrent writing).

The resulting triple store was then turned into a number of JSON files, including one for the IIIF manifest identifiers and their matching sequential number. I used base-85 numbers for the latter, as this gave me a very efficient way to encode large numbers.

In total, this process only takes a couple of hours for the current sample of ca. 80,000 manifests.

Workflow, after harvesting and sampling into *_sample.txt files

mv *_sample.txt ../indexer/corpus
cd indexer
./build.sh
./pictor >> db.txt
python3 jsonify.py

Web application

Finally, a web interface with some JavaScript allows to enter one or several keywords which are then looked up in the index. The resulting matches are presented as IIIF thumbnails, together with the manifest URL and the label metadata. A random selection of keywords is also present.

I also note that this is a completely serverless application, which hosts the necessary JSON statically and reads them into the browser memory upon loading the page. Obviously, this approach has its limitations, but, as with my Ulpia project, the benefits of not having to spin up a server for this tool outweigh the disadvantages.

Technical remarks

  • Not only do institutions seem to neglect IIIF discovery somewhat, several of the APIs I used, suffered frustrating hiccups like timeouts, refused connections or faulty resumption tokens. When I first started working on this, it seemed as if some collections even actively tried to limit scraping or crawling, but to be fair, there was usually a technical issue and several of the institutions I contacted, replied in a really constructive and helpful way.

  • Parsing IIIF manifests (both version 2 and 3 manifests are current) with Go has taught me that a lot of institutions seem to implement their own interpretation of the API rather than follow the specifications. Mandatory fields are left out, fields have different data formats (strings instead of arrays and such), and so on.

  • I did some experiments with SQLite as a database backend for this application and for the requesting/indexing phase. The first, inspired by the recent sqlite3 WASM/JS functionality, I just could not get up and running. The second, I found out, is not a viable option. Even if you insert data into SQLite concurrently with Go routines, SQLite apparently forces everything to sequential writing? Not sure about this info, though...

Wild plan and call to action

Finally, some daydreaming. I made the discovery tool for a sample of the manifests I have collected, but what I would really like to do is push the limits and see how many manifests I can process and still host the index on a static webpage. Currently, for ca. 80,000 manifests, the JSON files are only slightly above 25 MB in total, so this could definitely be scaled up.

So if you or your instution want to participate in this experiment, or simply deposit your IIIF manifests in the central repository, please get in touch with me.

See also

A very similar initiative to Pictor is the Simple IIIF Discovery by the National Gallery.

Acknowledgements

Since first publishing this project, many people have reached out with kind comments and useful suggestions. As a result, Pictor has become a better and more comprehensive tool!

Special thanks go to Etienne Posthumus, Bob Coret, Alexander Winkler, Glen Robson, Jules Schoonman, Johannes Baiter, Eduardo Fernández, Mek, Jörg Lehmann, Jolan Wuyts and anyone else I might forget...