OBVILCorpusImporter
This project is intended to ease the mass import of the OBVIL Library into the OBVIL OAI-PMH repository.
What is this script doing
Once launched with the proper command, (for instance
python3 scrap_obvil_corpora.py -s "crawled_data" -c ../configs/config_omeka.json
) this will crawls the specified1
OBVIL Corpora available in the
OBVIL Library.
It will:
- saves XML/TEI version of the texts in the specified directory (I.e.
"crawled_data"
); - extracts the relevant header meta-data to be exposed in the OAI-PMH repository (eg. dc:creator, dc:relation, dc:rights, dc:format, dc:identifier, dc:title, dc:contributor...)
- creates a thumbnail ("vignette") for each document. All the thumbnails have been generated once and are stored here. In case some are missing, you may consider scp them directly with your admin privileges.
- builds one Omeka csv import file per specified project with all the necessary information in the
specified directory (I.e.
"crawled_data"
);.
Tl;dr:
python3 scrap_obvil_corpora.py -s "crawled_data" -c ../configs/config_omeka.json
- All you need is in the folder
crawled_data
.
What it does not do (i.e. DIY)
To successfully import the documents into the OAI-PMH repository, you will need to:
- Run this script with the right options and configuration.
- Put the generated vignettes on the right place on the server if they are missing.
- Manually import the generated CSV file into Omeka, with proper rights and mappings.
Disclamer
-
Should you run this spiders, you are going to scrap A LOT of data. Use at your own risk !
-
The text provided by the OBVIL are copyrighted.
1 To specify which corpora should be imported, you will need to custom a configuration file. See the "configs" directory of this repo. ↩