SEG17

This script process segmentation, normalization and lemmatization of XML-TEI encoded files.

Getting starded

To install SEG17, using command lines, you have to :

clone or download this repository

git clone git@github.com:e-ditiones/SEG17.git
cd SEG17

Segmentation

create a first virtual environment and activate it

python3 -m venv env
source env/bin/activate

install dependencies

pip install -r requirements.txt

if you want to split your text

python3 scripts/segment_text.py path/to/file

You will get filename_segmented.xml.

Lemmatisation

The virtual env to be used is env.
install lemmatisation models

PIE_EXTENDED_DOWNLOADS=~/MesModelsPieExtended pie-extended download fr

if you want to lemmatize your segmented file

PIE_EXTENDED_DOWNLOADS=~/MesModelsPieExtended python3 scripts/lemmatize path/to/file_segmented.xml

In output/data.csv, you will find the results of the lemmatisation.

Normalisation LSTM

First, you have to deactivate the previous virtual env, using :

deactivate

create a second virtual environment and activate it

python3 -m venv norm_lstm
source env/bin/activate

install dependencies

pip install -r NORM17-LSTM/requirements.txt

download the model

cd NORM17-LSTM
bash download_model.sh

if you want to normalize your segmented file

cd ..
python3 scripts/normlize_lstm.py path/to/file_segmented

The file output/data.csv will be updated and contain the result of the normalisation.

NER

Get an XML file

Using the created csv file, csv_to_xml.py will constitute an XML file.

First, you have to deactivate the previous virtual env, using

deactivate

Then, activate the first virutal env

source env/bin/activate

Get the annotated XML file

python3 scripts/csv_to_xml.py path/to/file_segmented

You will get file_annotated.xml.

How it works

The segmentation

Using the Level-2_to_level-3.xsl XSL stylesheet, the script adds XML-TEI tags to split the text in segments (<seg>). For each <p>(paragraph) and <l>(line), using some poncuation marks (.;:!?), the script level2to3.py split the text in segments captured in <seg> elements.

The lemmazition

For lemmatisation, we use Pie-extended and the "fr" model.

The original version, and not the normalised version, is lemmatised.

Credits À CHANGER

This repository is developed by Alexandre Bartz with the help of Simon Gabay, as part of the project e-ditiones.

Licences

Our work is licenced under a Creative Commons Attribution 4.0 International Licence.

Pie-extended is under the Mozilla Public License 2.0.

Morphalou is under the LGPL-LR.

Cite this repository À CHANGER

Alexandre Bartz, Simon Gabay. 2020. Lemmatization and normalization of French modern manuscripts and printed documents. Retrieved from https://github.com/e-ditiones/SEG17.

alexbartz/temp_