Create training data for Word Sense Disambiguation (WSD) deep learning
A Java based project using my own Semantic Lexicon for creating word sense disambiguation training data for deep learning. WordNet does not provide a clear enough set of semantic nouns, so I've created my own clear and practical training set.
This uses the Apache OpenNLP parser for sentence boundary detection, and Part of Speech (POS) tagging using Penn tags.
Creates a Java set in ./dist/
by running
gradle clean build setup
Added support for reading .txt
, .gz
and Peter's .parsed
file formats for setting up
Unlabelled training data
The processed data is then used to train an LSTM using Keras/Tensorflow that can be loaded to get a neural network that will label the correct Synset ID (according to the lexicon) and assing a Synset ID to an ambiguous noun.
With the right lexicon / training data this seems to get around a 75% accuracy level.