free-french-treebank

Free French Treebank

Last release : 130612

Description

The development of this resource is part of a bigger project which aims at building a Free French treebank allowing to train statistical systems on common NLP tasks such as text segmentation, morphological analysis, chunking, parsing...

Licence

The current resource (i.e. annotations) is distributed under the terms of this Lesser General Public License for Linguistic Resources [1]. This means you can use it in the context you want, you can modify it, and distribute it as long as you do the same with your contribution. If you use this project to support academic research, then please cite the following paper as appropriate [2].

Notes on the current release

The current version is based on the frwikinews-20130110 articles dump [3]. It has 28000 news articles covering a period from January 2005 to now. After filtering of sentences with less than 5 tokens the version has 87461 sentences and 2535396 tokens. Texts are available under Creative Commons Attribution 2.5 (CC-BY 2.5) licence. Prior versions from September 2005 are in public domain. The text format have been cleaned up by using a Java Wikipedia API [4], then tokenized using a rule-/dictionary-based tokenizer [5], then POS tagged by the Stanford tagger [6]. Models were built on the human verified corpus [7]. In the associate paper, we show that automatically annotated data can be used to train a pos tagger with similar performance than the original one (no statistically significant difference) if the automatically annotated corpus is large enough.

The resource contains

  • xml-bz2 XML source archive
  • txt raw text
  • txt-tok a sentence per line and whitespace-separated words
  • txt-tok-pos a pos tag is associated with each word and is separated from this one with an underscore

Feel free to contact us nicolas.hernandez @ univ-nantes.fr or to contribute

Download

Free French Treebank [0]

References