SEM (Segmenteur-Étiqueteur Markovien) is a free NLP tool relying on Machine Learning technologies, especially CRFs. SEM provides powerful and configurable preprocessing and postprocessing. SEM also has an online version.
- A GUI for manual annotation (requires TkInter)
- from terminal: run
python -m sem annotation_gui
- fast annotation: keyboard shortcuts and document-wide annotation broadcast
- can load pre-annotated files
- support for hierarchical tags (dot-separated, eg: "noun.common")
- handles multiple input format
- export in different formats
- from terminal: run
- A GUI for easier use (requires TkInter)
- on Linux: double-clic on sem_gui.sh
- on Windows: double-clic on sem_gui.bat
- from terminal: run
python -m sem gui
- segmentation
- segmentation for: French, English
- easy creation and integration of new tokenisers
- feature generation
- XML file to write features without coding them
- single-token and multi-token dictionary features
- Regular expression features
- sequenced features
- train/label mode
- display option for features that are useful for generation, but not needed in output
- exporting output
- supported export formats: CoNLL, text, HTML (from plain text), two XML-TEI (one specific to NP-chunks and another one for the rest)
- easy creation and integration of new exporters
- extension of existing features
- automatic integration of new segmenters and exporters
- semi automatic integration of new feature functions
- easy creation of new CSS formats for HTML exports
- install SEM
- see install.md
- It will compile Wapiti and create necessary directories. Currently, SEM datas are located in
~/sem_data
- run tests
- run
python -m sem --test
in a terminal
- run
- run SEM
- run GUI (see "main features" above) and annotate "non-regression/fr/in/segmentation.txt"
- or run:
python -m sem tagger resources/master/fr/NER.xml ./non-regression/fr/in/segmentation.txt -o sem_output
- French Treebank by Abeillé et al. (2003): corpus used for POS and chunking.
- NER annotated French Treebank by Sagot et al. (2012): corpus used for NER.
- Lexique des Formes Fléchies du Français (LeFFF) by Clément et al. (2004): french lexicon of inflected forms with various informations, such as their POS tag and lemmatization.
- Wapiti by Lavergne et al. (2010): linear-chain CRF library.
- setuptools: to install SEM.
- Tkinter: for GUI modules (they will not be installed if Tkinter is not present).
- Windows only: MinGW64: used to compile Wapiti on Windows.
- Windows only: POSIX threads for Windows: if you want to multithread Wapiti on Windows.
- GUI-specific: TkInter: if you want to launch SEM's GUI.
- Add a tutorial. Some of it done in section "retrain SEM" in manual.
- add lemmatiser.
- have more unit tests
- improve segmentation
- handle URLs starting with country indicator (ex: "en.wikipedia.org")
- handle URLs starting with subdomain (ex: "blog.[...]")
- DUPONT, Yoann et PLANCQ, Clément. Un étiqueteur en ligne du Français. session démonstration de TALN-RECITAL, 2017, p. 15.
- Online interface
- (best RECITAL paper award) DUPONT, Yoann. Exploration de traits pour la reconnaissance d’entités nommées du Français par apprentissage automatique. RECITAL, 2017, p. 42.
- Named Entity Recognition (new, please use this one)
- TELLIER, Isabelle, DUCHIER, Denys, ESHKOL, Iris, et al. Apprentissage automatique d'un chunker pour le français. In : TALN2012. 2012. p. 431–438.
- Chunking
- TELLIER, Isabelle, DUPONT, Yoann, et COURMET, Arnaud. Un segmenteur-étiqueteur et un chunker pour le français. JEP-TALN-RECITAL 2012
- Part-Of-Speech Tagging
- chunking
- DUPONT, Yoann et TELLIER, Isabelle. Un reconnaisseur d’entités nommées du Français. session démonstration de TALN, 2014, p. 40.
- Named Entity Recognition (old, please do not use)
@inproceedings{dupont2017etiqueteur,
title={Un {'e}tiqueteur en ligne du fran{\c{c}}ais},
author={Dupont, Yoann and Plancq, Cl{'e}ment},
booktitle={24e Conf{'e}rence sur le Traitement Automatique des Langues Naturelles (TALN)},
pages={15--16},
year={2017}
}
@inproceedings{dupont2018exploration,
title={Exploration de traits pour la reconnaissance d’entit{'e}s nomm{'e}es du Fran{\c{c}}ais par apprentissage automatique},
author={Dupont, Yoann},
booktitle={24e Conf{'e}rence sur le Traitement Automatique des Langues Naturelles (TALN)},
pages={42},
year={2018}
}
@inproceedings{tellier2012apprentissage,
title={Apprentissage automatique d'un chunker pour le fran{\c{c}}ais},
author={Tellier, Isabelle and Duchier, Denys and Eshkol, Iris and Courmet, Arnaud and Martinet, Mathieu},
booktitle={TALN2012},
volume={2},
pages={431--438},
year={2012}
}
@inproceedings{tellier2012segmenteur,
title={Un segmenteur-{'e}tiqueteur et un chunker pour le fran{\c{c}}ais (A Segmenter-POS Labeller and a Chunker for French)[in French]},
author={Tellier, Isabelle and Dupont, Yoann and Courmet, Arnaud},
booktitle={Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 5: Software Demonstrations},
pages={7--8},
year={2012}
}
@article{dupont2014reconnaisseur,
title={Un reconnaisseur d’entit{'e}s nomm{'e}es du Fran{\c{c}}ais (A Named Entity recognizer for French)[in French]},
author={Dupont, Yoann and Tellier, Isabelle},
journal={Proceedings of TALN 2014 (Volume 3: System Demonstrations)},
volume={3},
pages={40--41},
year={2014}
}