/ixa-pipe-domainterms

Primary LanguagePerlGNU General Public License v3.0GPL-3.0

ixa-pipe-domainterms

ixa-pipe-domainterms is a domain terms tagger. Given a dictionary for a specific domain, the ixa-pipe-domainters module takes a NAF document containing wf and term elements as input, recognizes terms on the given dictionary, and outputs a NAF document with those terms tagged on markables element (<markables source="ixa-pipe-domainterms">)

ixa-pipe-domainterms is part of IXA pipes, a multilingual NLP pipeline developed by the IXA NLP Group.

USAGE

The ixa-pipe-domainterms requires a NAF document (with wf and term elements) as standard input and outputs NAF through standard output. You can get the necessary input for ixa-pipe-domainterms by piping ixa-pipe-tok and ixa-pipe-pos as shown in the example below.

A dictionary file is required. There are two possible formats for the dictionary file:

  • plain text file: one dictionary entry per line formatted as
dictionary_entry||entry_id||class_id

class_id could be omitted. An entry example is shown here

tiempo de reformas||ID-89||CID-445
  • JSON file: a data array, which elements have (at least) these elements: id and desc (dictionary entry name). An example is shown here
{"data":[
{"id":"ICD-1-P","desc":"P","idclass":"","nivel":"1"},
{"id":"ICD-1-S","desc":"S","idclass":"","nivel":"1"},
{"id":"ICD-2-1","desc":"Autenticación y certificación digital","idclass":"ICD-1-P","nivel":"2"},
{"id":"ICD-2-2","desc":"Tiempo_de_reformas","idclass":"ICD-1-P","nivel":"2"},
{"id":"ICD-2-3","desc":"Control de tráfico de red","idclass":"ICD-1-S","nivel":"2"}]}

There are several parameters:

  • -D (required): specify the dictionary file as parameter.
  • -j (optional): use this parameter if the dictionary file is formatted as JSON instead of plain text file.

You can call to ixa-pipe-domainterms module as follows:

cat text.txt | ixa-pipe-tok | ixa-pipe-pos | perl ixa-pipe-domainterms.pl -D dict.txt

or

cat text.txt | ixa-pipe-tok | ixa-pipe-pos | perl ixa-pipe-domainterms.pl -D dict.json -j

Contact information

Arantxa Otegi
arantza.otegi@ehu.es
IXA NLP Group
University of the Basque Country (UPV/EHU)
E-20018 Donostia-San Sebastián