/temp_

Primary LanguagePython

SEG17

This script process segmentation, normalization and lemmatization of XML-TEI encoded files.

Getting starded

To install SEG17, using command lines, you have to :

  • clone or download this repository
git clone git@github.com:e-ditiones/SEG17.git
cd SEG17

Segmentation

  1. create a first virtual environment and activate it
python3 -m venv env
source env/bin/activate
  1. install dependencies
pip install -r requirements.txt
  1. if you want to split your text
python3 scripts/segment_text.py path/to/file
  1. You will get filename_segmented.xml.

Lemmatisation

  1. The virtual env to be used is env.

  2. install lemmatisation models

PIE_EXTENDED_DOWNLOADS=~/MesModelsPieExtended pie-extended download fr
  1. if you want to lemmatize your segmented file
PIE_EXTENDED_DOWNLOADS=~/MesModelsPieExtended python3 scripts/lemmatize path/to/file_segmented.xml
  1. In output/data.csv, you will find the results of the lemmatisation.

Normalisation LSTM

  1. First, you have to deactivate the previous virtual env, using :
deactivate
  1. create a second virtual environment and activate it
python3 -m venv norm_lstm
source env/bin/activate
  1. install dependencies
pip install -r NORM17-LSTM/requirements.txt
  1. download the model
cd NORM17-LSTM
bash download_model.sh
  1. if you want to normalize your segmented file
cd ..
python3 scripts/normlize_lstm.py path/to/file_segmented
  1. The file output/data.csv will be updated and contain the result of the normalisation.

NER

Get an XML file

Using the created csv file, csv_to_xml.py will constitute an XML file.

  1. First, you have to deactivate the previous virtual env, using
deactivate
  1. Then, activate the first virutal env
source env/bin/activate
  1. Get the annotated XML file
python3 scripts/csv_to_xml.py path/to/file_segmented
  1. You will get file_annotated.xml.

How it works

The segmentation

Using the Level-2_to_level-3.xsl XSL stylesheet, the script adds XML-TEI tags to split the text in segments (<seg>). For each <p>(paragraph) and <l>(line), using some poncuation marks (.;:!?), the script level2to3.py split the text in segments captured in <seg> elements.

The lemmazition

For lemmatisation, we use Pie-extended and the "fr" model.

The original version, and not the normalised version, is lemmatised.

Credits À CHANGER

This repository is developed by Alexandre Bartz with the help of Simon Gabay, as part of the project e-ditiones.

Licences

Licence Creative Commons
Our work is licenced under a Creative Commons Attribution 4.0 International Licence.

Pie-extended is under the Mozilla Public License 2.0.

Morphalou is under the LGPL-LR.

Cite this repository À CHANGER

Alexandre Bartz, Simon Gabay. 2020. Lemmatization and normalization of French modern manuscripts and printed documents. Retrieved from https://github.com/e-ditiones/SEG17.