/Tabular2Lexicog

A conversion tool of tabular data into Ontolex Lexicography module (Lexicog)

Primary LanguagePythonApache License 2.0Apache-2.0

Converting tabular data to Ontolex-Lexicog

Ontolex-Lemon and Lexicog

Lexicography is the science of words and their semantic relationships. Wouldn't it be beneficial to take advantage of linked data for lexicography too? Well, this is the motivation behind Ontolex-Lemon.

Lemon stands for the lexicon model for ontologies (lemon) which provides rich linguistic grounding for ontologies. By rich, the creators theoretically aim at all types of information related to words in dictionaries, such as morphological and syntactic properties. The Ontolex-Lemon is the result of the W3C Ontology-Lexica Community Group.

One of the useful modules in Ontolex-Lemon is the Ontolex-Lemon lexicography module (lexicog).

Conversion to Ontolex-Lexicog

This tool gets as input the lexicographic data in a tabular format, such as comma-separated values (CSV) and tab-separated values (TSV). In the current version of the tool, the conversion can be done for the followings:

  • headwords
  • part-of-speech tags
  • senses
  • examples
  • idioms
  • and see also.

The conversion can be configured using a configuration file called configuration.json. In this file, you can set various information such as source and target languages with their codes, PoS tags according to the Lexinfo module.

To run the code, clone or download this repository and pass the input file and the configuration files respectively following -input and -config arguments in the command line:

python -input Sample_dictionary.tsv -config configuration.json 

Please note that this script can deal with relatively simple structures for the moment.

A working example

These are a few entries in a Kurdish dictionary in tabular format (original data in tsv):

Headword POS Sense (translation) Example Expression Cf.
aferîde m creature
aferîn excl bravo bravo ji ... re: good for ...
afirandin v.t. to create
afîş f poster
aga adj aware
agadarî f information announcement; awareness
agah aga
agahdarî agadarî
agir m fire agir danîn bi: to set fire to

In order to carry out the conversion correctly, we set a few conventions:

  • Senses are separated using ; or ,.
  • Any part-of-speech tag can be used, as long as the correct mappings are provided in the configuration file. This regards Word, MultiwordExpression and Affix classes in Ontolex-Lemon.

The results is created in a folder with the source language name, as in Kurmanji.