This repository contains code to analyse historical books and newspapers datasets using Apache Spark - .
This repository an improved version of defoe
Defoe supports already several datasets. In order to query a daset, defoe needs a list of files and/or directories that conform the dataset. Many of those files (used so far), can be found under the others directory. Those files would need to be modifed, in order to update them with the corresponding paths.
This dataset consists of ~1TB of digitised versions of ~68,000 books from the 16th to the 19th centuries. The books have been scanned into a collection of XML documents. Each book has one XML document one per page plus one XML document for metadata about the book as a whole. The XML documents for each book are held within a compressed, ZIP, file. Each ZIP file holds the XML documents for a single book (the exception is 1880-1889's 000000037_0_1-42pgs__944211_dat.zip which wholds the XML documents for 2 books). These ZIP files occupy ~224GB.
This dataset is available under an open, public domain, licence. See Datasets for content mining and BL Labs Flickr Data: Book data and tag history (Dec 2013 - Dec 2014). For links to the data itself, see Digitised Books largely from the 19th Century. The data is provided by Gale, a division of CENGAGE.
This dataset consists of ~1TB of digitised versions of newspapers from the 18th to the early 20th century. Each newspaper has an associated folder of XML documents where each XML document corresponds to a single issue of the newspaper. Each XML document conforms to a British Library-specific XML schema.
This dataset is available, under licence, from Gale, a division of CENGAGE. The dataset is in 5 parts e.g. Part I: 1800-1900. For links to all 5 parts, see British Library Newspapers.
The code can also handle the Times Digital Archive (TDA).
This dataset is available, under licence, from Gale, a division of CENGAGE.
The code was used with papers from 1785-2009.
This dataset is available, under licence, from Find My Past. To run queries with this dataset we can chose either to use:
- ALTO model: for running queries at page level. These are the same queries for the BL books.
- FMP model: for running queries at article level.
Papers Past provide digitised New Zealand and Pacific newspapers from the 19th and 20th centuries.
Data can be accessed via API calls which return search results in the form of XML documents. Each XML document holds one or more articles.
This dataset is available, under licence, from Papers Past.
National Library of Scotland provide several digitised collections, such as:
- Encyclopaedia Britanica from the 18th and 20th centuries.
- ChapBooks
- Ladies' Edinburgh Debating Society
- Scottish Gazetteers
Note, that ALL collections offered by NLS use the same XML and METS format. Therefore, we can use the defoe NLS model to query any of those collections.
Set up (local):
Set up (Urika):
- Set up Urika environment
- Import data into Urika
- Import British Library Books and Newspapers data into Urika (Alan Turing Institute-Scottish Enterprise Data Engineering Program University of Edinburgh project members only)
Set up (Cirrus - HPC Cluster):
Set up (VM):
Run queries:
- Specify data to query
- Specify Azure data to query
- Run individual queries
- Run multiple queries at once - just one ingestion
- Extracting, Transforming and Saving RDD objects to HDFS as a dataframe
- Loading dataframe from HDFS and performing a query
- Extracting, Transforming and Saving RDD objects to PostgreSQL database
- Loading dataframe from PostgreSQL database and performing a query
- Extracting, Transforming and Saving RDD objects to ElastiSearch
- Loading dataframe from ElasticSearch and performing a query
Available queries:
- ALTO documents (British Library Books and Find My Past Newspapers (at page level))
- British Library Newspapers (these can also be run on the Times Digital Archive)
- FMP newspapers (Find My Past Newspapers datasets at article level)
- Papers Past New Zealand and Pacific newspapers
- Generic XML document queries (these can be run on arbitrary XML documents)
- NLS queries (these can be run on the Encyclopaedia Britannica, Scottish Gazetteers or ChapBooks datasets)
- HDFS queries (running queries against HDFS files - for interoperability across models)
- ES queries (running queries against ES - for interoperability across models)
- PostgreSQL queries (running queries against PostgreSQL database - for interoperability across models)
- NLSArticles query (just for extracting automatically articles from the Encyclopaedia Britannica dataset)
Developers:
The code is called "defoe" after Daniel Defoe, writer, journalist and pamphleteer of the 17-18 century.