/OBVILCorpusImporter

A python scrapy spider intended to retrieve .xml and .epubs from OBVIL corpora

Primary LanguagePythonBSD 2-Clause "Simplified" LicenseBSD-2-Clause

OBVILCorpusImporter

OBVILCorpusImporter

This project is intended to ease the mass import of the OBVIL Library into the OBVIL OAI-PMH repository.

What is this script doing

Once launched with the proper command, (for instance
python3 scrap_obvil_corpora.py -s "crawled_data" -c ../configs/config_omeka.json ) this will crawls the specified1 OBVIL Corpora available in the OBVIL Library.

It will:

  • saves XML/TEI version of the texts in the specified directory (I.e. "crawled_data");
  • extracts the relevant header meta-data to be exposed in the OAI-PMH repository (eg. dc:creator, dc:relation, dc:rights, dc:format, dc:identifier, dc:title, dc:contributor...)
  • creates a thumbnail ("vignette") for each document. All the thumbnails have been generated once and are stored here. In case some are missing, you may consider scp them directly with your admin privileges.
  • builds one Omeka csv import file per specified project with all the necessary information in the specified directory (I.e. "crawled_data");.
Tl;dr:
  • python3 scrap_obvil_corpora.py -s "crawled_data" -c ../configs/config_omeka.json
  • All you need is in the folder crawled_data.

What it does not do (i.e. DIY)

To successfully import the documents into the OAI-PMH repository, you will need to:

  • Run this script with the right options and configuration.
  • Put the generated vignettes on the right place on the server if they are missing.
  • Manually import the generated CSV file into Omeka, with proper rights and mappings.

Disclamer

  • Should you run this spiders, you are going to scrap A LOT of data. Use at your own risk !

  • The text provided by the OBVIL are copyrighted.

1 To specify which corpora should be imported, you will need to custom a configuration file. See the "configs" directory of this repo.