SEG17
This script process segmentation, normalization and lemmatization of XML-TEI encoded files.
Getting starded
To install SEG17, using command lines, you have to :
- clone or download this repository
git clone git@github.com:e-ditiones/SEG17.git
cd SEG17
Segmentation
- create a first virtual environment and activate it
python3 -m venv env
source env/bin/activate
- install dependencies
pip install -r requirements.txt
- if you want to split your text
python3 scripts/segment_text.py path/to/file
- You will get
filename_segmented.xml
.
Lemmatisation
-
The virtual env to be used is
env
. -
install lemmatisation models
PIE_EXTENDED_DOWNLOADS=~/MesModelsPieExtended pie-extended download fr
- if you want to lemmatize your segmented file
PIE_EXTENDED_DOWNLOADS=~/MesModelsPieExtended python3 scripts/lemmatize path/to/file_segmented.xml
- In
output/data.csv
, you will find the results of the lemmatisation.
Normalisation LSTM
- First, you have to deactivate the previous virtual env, using :
deactivate
- create a second virtual environment and activate it
python3 -m venv norm_lstm
source env/bin/activate
- install dependencies
pip install -r NORM17-LSTM/requirements.txt
- download the model
cd NORM17-LSTM
bash download_model.sh
- if you want to normalize your segmented file
cd ..
python3 scripts/normlize_lstm.py path/to/file_segmented
- The file
output/data.csv
will be updated and contain the result of the normalisation.
NER
Get an XML file
Using the created csv file, csv_to_xml.py
will constitute an XML file.
- First, you have to deactivate the previous virtual env, using
deactivate
- Then, activate the first virutal env
source env/bin/activate
- Get the annotated XML file
python3 scripts/csv_to_xml.py path/to/file_segmented
- You will get
file_annotated.xml
.
How it works
The segmentation
Using the Level-2_to_level-3.xsl
XSL stylesheet, the script adds XML-TEI tags to split the text in segments (<seg>
).
For each <p>
(paragraph) and <l>
(line), using some poncuation marks (.;:!?), the script level2to3.py
split the text in segments captured in <seg>
elements.
The lemmazition
For lemmatisation, we use Pie-extended and the "fr" model.
The original version, and not the normalised version, is lemmatised.
Credits À CHANGER
This repository is developed by Alexandre Bartz with the help of Simon Gabay, as part of the project e-ditiones.
Licences
Our work is licenced under a Creative Commons Attribution 4.0 International Licence.
Pie-extended is under the Mozilla Public License 2.0.
Morphalou is under the LGPL-LR.
Cite this repository À CHANGER
Alexandre Bartz, Simon Gabay. 2020. Lemmatization and normalization of French modern manuscripts and printed documents. Retrieved from https://github.com/e-ditiones/SEG17.