
[ICDAR 2023] EEBO-Verse: Sifting for Poetry in Large Early Modern Corpora Using Visual Features

Primary LanguagePython


The repository host code and data for paper: EEBO-Verse: Sifting for Poetry in Large Early Modern Corpora Using Visual Features.

If you are interested in our project and would like to get help from us, please feel free to contact us via danlu@ucsd.edu or create an issue to this respository!


There are two parts of the data: the image and the annotation. Due to a copyright issue, we do not directly share the original resolution of the EEBO image dump to the public. However, if you are affiliated with a university or institution that has a subscription to EEBO (Early English Books Online), please contact me directly to obtain a copy of the image dump or you can contact your libiary to obtain a copy of it. The current list of institution can be found in: https://www.proquest.com/athenslogin#.

We provide some helper scripts in data_scripts to process the XML and made it available for downstream tasks such as classification or OCR.

TCP raw data

According to EEBO-TCP's navigation page, you can:

  • download indiviual xml file for each book from here
  • download the whole set from the public folder via dropbox
Some tips on parsing the XML
  • The STC T tag, for example <STC T="S">10558.5</STC> , is the STC catelog number of the collection, one can search 10558.5 on EEBO online to view the images and text.
  • The volume id, for example <VID>8932</VID>, is the folder name of the collection in EEBO's disk dump.
  • The EEBO citation ID, <idno type="EEBO-CITATION">13672099</idno>.

For more information, please refer to TCP's FAQ page.

How to obtain images from local CD-disk version of EEBO dumps?

If you also have the CD-disk version of EEBO dumps, you will find the citation ID in the Disk*.xml. Pleaes contact your univesrity libaray to obtain the dumps. An example script can be found in data_scripts/lookup_disk.py.

<AUTH>Bernard, of Clairvaux, Saint, 1090 or 91-1153.</AUTH>
<TITLE>A looking-glass for all new-converts to whatsoever perswasion</TITLE>

with the IMAGE_ID, you could locate the images of the collection on disk. If you don't have a copy locally, you can also visit the URL to download the file from proquest.

Poetry text

For humanities researchers who are only interested in poetry text, we create a simply text dump here, via google drive. The script to process the xml file to obtain poetry only text is available here.

We will released the poetry text we detected in the unannotated portion of EEBO later this year.

Citing the paper

If you found the repository helpful to your research, please cite our paper and/or the EEBO/TCP project:

  title={EEBO-Verse: Sifting for Poetry in Large Early Modern Corpora Using Visual Features},
  author={Chen, Danlu and Jiang, Nan and Berg-Kirkpatrick, Taylor},
  booktitle={International Conference on Document Analysis and Recognition},