Provides interface (apache thrift) for wordvec lookup and find nearest neighbors.
Uses leveldb for word/vector lookup. Uses annoy for nearest vector neighbours lookup.
Install https://yarnpkg.com/
Then install dependencies by runing in this directory:
yarn
Recommend to use Visual Studio Code as IDE, https://code.visualstudio.com
Project is implented in typescript, http://www.typescriptlang.org
The input file format should be:
- text file, one line per word
- N tokens deliminated by whitespace(s) per line
- first token is the word
- subsequent tokens are the values in the vector for the word
- number of tokens per every line must be the same
After indexing, the file is converted to:
- a level db providing two eay word <-> vector lookup
- an annoy db for nearest neighbor (cosine similarity) vectors lookup
Index example:
node src/index.js --annoy -t 50 ../extern/glove.6B/glove.6B.300d.txt
Then launch the query server on the indexed data:
Example:
node src/server.js -p 9999 -e -a --data ../extern/glove.6B
# the docker equivalent
docker run --rm -v /host/path/to/data:/data -p 9999:9999 mhzed/wordvec_query server -- -p 9999 -a --data /data
To query the server via command line:
node src/client.js --server http://localhost:9999/thrift "wikipedia"
# the docker version
docker run --rm mhzed/wordvec_query client -- --server http://docker.for.mac.localhost:9999/thrift "wikipedia"
To query the server from whatever language you prefer, the protocol is defined in file ./thrift/vecquery.thrift. Use Apache thrift to compile the protocol into the language choice you prefer.
The full build command is
# fish
npm run peg;npm run compile-thrift;npm run compile;npm run test
# bash
npm run peg && npm run compile-thrift && npm run compile && npm run test