it_makes_sense_WSD: A Python repository from PaulHuygen

#KAF/NAF Python wrapper for the It-Makes-Sense WSD system#

This repository implements a wrapper in Python around the It Makes Sense (IMS) system for Word Sense Disambiguation of English text to allow KAF or NAF files as input or output. The description of the KAF or NAF format can be found at https://github.com/opener-project/kaf and http://www.newsreader-project.eu/files/2013/01/techreport.pdf

##Requirements## This are the only two dependencies:

KafNafParserPy library, that can be found at https://github.com/cltl/KafNafParserPy
IMS system that can be found at http://www.comp.nus.edu.sg/~nlp/software.html

In case you want to use ILI synsets of WN30 instead of sensekeys of WN171 you should get also the mappings between version provided by the UPC (http://www.talp.upc.edu/index.php/technology/tools/45-textual-processing-tools/98-wordnet-mappings/)

##Installation## The installation is pretty simple, just follow these steps:

Go to the folder where you want to install this repository
Clone this repository
Clone the KafNafParserPy
Run install_ims.sh in the repository to download and install all the required files for the original IMS system

You could skip the step number 4 in case that you have already the IMS system installed on your machine (you should make sure that the models folder is actually a subfolder of the main root folder for your IMS installation). In this case, edit the file path_to_ims.py and modify the variable PATH_TO_IMS to point to the correct path of IMS in your local machine.

This could be one example, given that you want to install this repository in ~/my_github:

cd ~/my_github
git clone https://github.com/rubenIzquierdo/it_makes_sense_WSD.git
cd it_makes_sense_WSD
install_ims.sh

##Usage##

The main script is the call_ims.py script. It reads a KAF or NAF file from the standard input, and writes the KAF or NAF file extended with sense information provided by the IMS WSD system to the standard output. The KAF/NAF input file must have at least the token layer (created by a tokeniser) and the term layer (created bt a Pos-tagger usually). There is one example file in the repository, the file input.kaf. To call to the IMS system with this file and write the result to the file output.kaf, you should just run:

cat input.kaf | python call_ims.py > output.kaf

You can get the help of the script by running:

call_ims.py -h
usage: cat myfile.naf | ./call_ims.py [-h] [-pos|-morphofeat]

Wrapper for the ItMakesSense WSD system that allows KAF/NAF as input and
output formats

optional arguments:
  -h, --help   show this help message and exit
  -ili30       Map skeys of WN171 provided by the system to ili synsets of
               WN30
  -pos         Use the POS tags of the pos attribute in the input KAf/NAF file
  -morphofeat  Use the POS tags of the morphofeat attribute in the input
               KAf/NAF file

You can force to use the pos-tag labels found in the input KAF/NAF file by specifying the parameter -pos or -morphofeat, which will refer to the pos attributes or to the morphofeat attributes on the term layer. If you do not provide any of these parameters, the IMS will perform internally pos-tagging and lemmatisation (but the postags and lemmas in the input term layer will not be modified). The parameter -ili30 would generate WordNet3.0 synsets instead of sensekeys of WordNet1.7.1. For instance:

    <term lemma="bit" morphofeat="NN" pos="N" tid="t19" type="open">
      <span>
        <target id="w19"/>
      </span>
      <externalReferences>
        <externalRef confidence="0.3391744933419935" reference="bit%1:23:01::" resource="ItMakesSense#WN-1.7.1"/>
        <externalRef confidence="0.16435686275758304" reference="bit%1:28:00::" resource="ItMakesSense#WN-1.7.1"/>
	...
===========
    <term lemma="bit" morphofeat="NN" pos="N" tid="t19" type="open">
      <span>
        <target id="w19"/>
      </span>
      <externalReferences>
        <externalRef confidence="0.3391744933419935" reference="ili-30-13761407-n" resource="WordNet-3.0"/>
        <externalRef confidence="0.16435686275758304" reference="ili-30-15246853-n" resource="WordNet-3.0"/>>
	...

##Creating IMS training data from KAF/NAF##

You can create training data for training your own models with IMS with the script convert_to_ims_training_format.py. Calling to this script with the option -h will provide you with the description of the parameters:

convert_to_ims_training_format.py -h
usage: convert_to_ims_training_format.py [-h] [-v] (-f FILE | -l FILE_PATHS)
                                         -o OUTPUT

Creates IMS training file from KAF/NAF files

optional arguments:
  -h, --help     show this help message and exit
  -v, --version  show program's version number and exit
  -f FILE        Single KAF/NAF file
  -l FILE_PATHS  File with a list of paths to KAF/NAF files
  -o OUTPUT      Output folder

The input can be either a single KAF/NAF file (-f option), or a file which contains a list of paths to KAF/NAF files (one per file, option -l). The output parameter -o specifies the folder where you want to store all the training files. This script will generate two files for every lemma, one with the instances and contexts for training, and one with the key senses. All the files will be stored in the output folder. Besides to this folder, a file will be generated with the same name of the folder and the suffix .word_list. This file contains all the lemmas (without suffices) that have been generated, so it can be used later to train the models by using the train_one.bash script provided with the IMS package.

##Trained models##

You can find trained models using SemCor and the WordNet glosses (both with wordnet version 3.0) here: http://kyoto.let.vu.nl/~izquierdo/ims_models/. You will need to download the models and modify the file call_ims.py to use the new models (you will need also the dict files of WordNet 3.0, as ItMakesSense uses these files). You will need to update these variables, at the top of the script:

ims_models = '/home/izquierdo/ruben_github/it_makes_sense_WSD/ims/semcor30_wngloss_models'     
wordnet_dict_folder = '/home/izquierdo/wordnets/wordnet-3.0/dict/index.sense'

##Contact##

Ruben Izquierdo
Vrije University of Amsterdam
ruben.izquierdobevia@vu.nl rubensanvi@gmail.com
http://rubenizquierdobevia.com/

PaulHuygen/it_makes_sense_WSD