Generate novel words to plug lexical gaps.
- Install conda if necessary and then navigate to this repo in a terminal and run
conda env create -f environment.yml
. Once the installation is complete, activate it withsource activate gap
. If you need to updateenvironment.yml
, you can update your installations withconda env update -f environment.yml
. - For a demo trained on WordNet definitions and GloVe vectors trained on Common Crawl, you can run
python ./demo.py
from the home directory.
- Edit the config file to match the locations for your in and out files.
- Get words with definitions, e.g. WordNet's tagged glosses. Lines must be in format
word\tdefinition
, which you can get from WordNet's "merged" XML files (/WordNet-3.0/glosstag/merged
) using the formatting script in this repository. - Get word vectors, e.g. pretrained GloVe vectors. In case you roll your own, each line contain a single token and its vector values, whitespace-separated, e.g.
give_up 1.2 -2.9 0.0...
. If you useword2vec
vectors, the first line of the vector file will by default record the size of the vocabulary and the vector length, but the scripts are written to ignore that line. - To train your own word creator, run
python ./train.py
. This will probably take a lot of time. You can run on GPU (rather than CPU, the default) by changing thegpu = false
togpu = true
in the config file. - To generate new words and definitions from your trained creator, run
python ./generate.py N
whereN
is the number of new words to generate. It's recommended to redirect this generation to some output file, e.g.python ./generate.py 100 > new_words.txt
.
Christiane Fellbaum. 1998. WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.
George A. Miller. 1995. WordNet: A Lexical Database for English. Communications of the ACM 38(11): 39–41.
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.