Reconciliation of manuscript sale catalogues entries
Manuscript sale catalogues propose list of manuscript for sale. A same manuscript can be sold multiple times.
Our objective is to detect such similar entries.
Entries of catalogues look like the following:
<item n="80" xml:id="CAT_000146_e80">
<num>80</num>
<name type="author">Cherubini (L.),</name>
<trait>
<p>l'illustre compositeur</p>
</trait>
<desc>L. a s.; 1836, 1 p 1 /2 in8.</desc>
<measure commodity="currency" unit="FRF" quantity="12">12</measure>
</item>
Most of the reconciliation process uses data from the <desc>
element of our xml files. We therefore need to correct typos to ease further post-processing, e.g.
L. a s.
->L. a. s.
in8
->in-8
1 /2
->1/2
1 p
->1 p.
The clean-xml.py
script [available here] tackles this problem:
python clean-xml.py -f FILENAME
processes one single filepython clean-xml.py -d DIRECTORY
processes all the files contained in a directory
We need to extract data from the desc
and transform
<item n="80" xml:id="CAT_000146_e80">
<num>80</num>
<name type="author">Cherubini (L.),</name>
<trait>
<p>l'illustre compositeur</p>
</trait>
<desc>L. a. s.; 1836, 1 p. in-8.</desc>
<measure commodity="currency" unit="FRF" quantity="12">12</measure>
</item>
into
{
"CAT_000156_e14_d1": {
"desc": "L. a. s.; 1836, 1 p. in-8. 12",
"price": 12,
"author": "Cherubini",
"date": 1836,
"number_of_pages": 1,
"format": 8,
"term": 4,
"sell_date": "Mars 1893"
}
}
and
<item n="80" xml:id="CAT_000146_e80">
<num>80</num>
<name type="author">Cherubini (L.),</name>
<trait>
<p>l'illustre compositeur</p>
</trait>
<desc><term>L. a. s.</term>;<date>1836</date>,
<measure type="length" unit="p" n="1">1 p.</measure> <measure unit="f" type="format" n="8">in-8</measure>.
<measure commodity="currency" unit="FRF" quantity="12">12</measure></desc>
</item>
To carry this task we use the extractor.py
[available here].
(xml output not fully implemented).
The reconciliation is carried out by the script reconciliator.py
[available here].
git clone https://github.com/katabase/reconciliation.git
cd reconciliation
python3 -m venv my_env
source my_env/bin/activate
pip3 install -r requirements.txt
Two main scripts are used:
- the first one,
extractor.py
, is for the extraction of the information in the xml files - the second one,
reconciliator.py
, is used to reconciliate the entries, i.e. to identify the entries corresponding to the same documents. The user has to provide an author (using the flag-a
) to filter the database. The user can also filter by date (using the flag -d).
The data will be stored in json in folders corresponding to the date and the authorname. Three files are created:
filtered_db.json
is the result of the extraction before the reconciliation of the entries.reconciliated_pairs.json
provides a list of all the probable similar documents, ordered by probabilityreconciliated_documents.json
provides the list of the documents that have been reconciliated.final_db.json
contains all the entries with the reconciliation done
We want to work on Mme de Sévigné.
- First, we create the database. In the script directory,
python3 extractor.py
- Then, we use the second script:
python3 reconciliator.py -a Sévigné
- The files will be stored in
output/json/Sevigne/
We want to select the production of Mme de Sévigné between 1680-1690:
- First, we create the database. In the script directory,
python3 extractor.py
- Then, we use the second script with the -d flag:
python3 reconciliator.py -a Sévigné -d 1680-1690
- The results will be stored in
output/json/Sevigne/1680-1690/
If you use these data, please cite this paper:
@inproceedings{gabay:howmanyDH2020,
AUTHOR = {Gabay, Simon and Rondeau Du Noyer, Lucie and Gille Levenson, Matthias, and Petkovic, Ljudmila, and Bartz, Alexandre},
TITLE = {Quantifying the Unknown. How many manuscripts of the marquise de Sévigné still exist?},
SHORTTITLE = {Quantifying the Unknown},
ADDRESS = {Ottawa, Canada},
MONTH = July,
YEAR = {2020},
BOOKTITLE = {DH2020: carrefours/intersections},
KEYWORDS = {Machine learning ; Manuscript sales catalogues ; 19th c. France; Mme de Sévigné},
}
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International Licence.