/wordsim

Primary LanguagePythonMIT LicenseMIT

wordsim

Preparations

Building the components requires the installation of build-essential and python-dev packages with sudo apt-get install build-essential python-dev. You must also have setuptools installed for python.

Dependencies

4lang

Install the newest version of 4lang. Notes:

  • downloadable pre-compiled graphs are sufficient
  • you don't have to modify the config files
  • set only the FOURLANGPATH and HUNTOOLSBINPATH environmental variable

Additional libraries

Install the newest version of:

Resources

After preparing the resources you should get the following directory structure:

wordsim  
└───resources
    ├───embeddings
    │   ├───senna
    │   │   └───combined.txt
    │   ├───huang
    │   │   └───combined.txt
    │   ├───word2vec
    │   │   └───GoogleNews-vectors-negative300.bin
    │   ├───glove
    │   │   └───glove.840B.300d.w2v
    │   ├───sympat
    │   │   └───sp_plus_embeddings_500.w2v
    │   └───paragram_300
    │       └───paragram_300_sl999.txt
    └───sim_data
        └───simlex
            └───SimLex-999.txt

Embeddings

SimLex data

Usage

Run python src/wordsim/regression.py configs/default.cfg to get regression on features from 6 embeddings (6 features) + wordnet metrics (4 features) + 4lang (2 features). You should get average correlation: 0.755074732764 as the result.

NOTE: wordsim requires ca. 15 GB of RAM to load all models

Citing

If you use the wordsim system in your experiments, please cite

Gábor Recski, Eszter Iklódi, Katalin Pajkossy, András Kornai: Measuring semantic similarity of words using concept networks In: Proceedings of the 1st Workshop on Representation Learning for NLP, 2016

@InProceedings{Recski:2016c,
  author    = {Recski, G\'{a}bor  and  Ikl\'{o}di, Eszter  and  Pajkossy, Katalin  and  Kornai, Andras},
  title     = {Measuring Semantic Similarity of Words Using Concept Networks},
  booktitle = {Proceedings of the 1st Workshop on Representation Learning for NLP},
  year      = {2016},
  address   = {Berlin, Germany},
  publisher = {Association for Computational Linguistics},
  pages     = {193--200}
}