Ruby GloVe

Ruby implementation of Global Vectors for Word Representations.

Overview

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

NOTE This is an early prototype.

Resources

Dependencies

This library relies on the rb-gsl gem for Matrix and Vector operations, therefore you need GNU Scientific Library installed.

Linux:

$ sudo apt-get install libgsl0-dev

OS X:

$ brew install gsl

Only compatible with MRI: tested in versions 2.0.x and 2.1.x

Installation

$ gem install glove

or add to your Gemfile

gem 'glove'

Usage

require 'glove'

# See documentation for all available options
model = Glove::Model.new

# Next feed it some text.
text = File.read('quantum-physics.txt')
model.fit(text)

# Or you can pass it a Glove::Corpus object as the text argument instead
corpus = Glove::Corpus.build(text)
model.fit(corpus)

# Finally, to query the model, we need to train it
model.train

# So far, word similarity and analogy task methods have been included:
# Most similar words to quantum
model.most_similar('quantum')
# => [["physic", 0.9974459436353388], ["mechan", 0.9971606266531394], ["theori", 0.9965966776283189]]

# What words relate to atom like quantum relates to physics?
model.analogy_words('quantum', 'physics', 'atom')
# => [["electron", 0.9858380292886947], ["energi", 0.9815122410243475], ["photon", 0.9665073849076669]]

# Save the trained matrices and vectors for later usage in binary formats
model.save('corpus.bin', 'cooc-matrix.bin', 'word-vec.bin', 'word-biases.bin')

# Later on create a new instance and call #load
model = Glove::Model.new
model.load('corpus.bin', 'cooc-matrix.bin', 'word-vec.bin', 'word-biases.bin')
# Now you can query the model again and get the same results as above

Performance

Thanks to the rb-gsl bindings for GSL, matrix/vector operations are fast. The glove algorythm itself, however, requires quite a bit of computational power, even the original C library. If you need speed, use smaller texts with vocabulaty size no more than 100K words. Processing text with 160K words (compilation of several books on quantum mechanics) on a late 2012 MBP (8GB RAM) with ruby-2.1.5 takes about 7 minutes:

 $ ruby -Ilib benchmark/benchmark.rb
                 user     system      total        real
Fit Text    11.320000   0.070000  11.390000 ( 11.387612)
Vocabulary size: 158323
Unique tokens: 2903
Co-occur     1.330000   0.250000 1107.720000 (300.738453)
Train      121.120000  12.960000  134.080000 (128.409034)
Similarity   0.010000   0.000000    0.010000 (  0.057423)
Give me the 3 most similar words to quantum
[["problem", 0.9977609386134489], ["mechan", 0.9977529272587808], ["classic", 0.9974759411408415]]
Analogy      0.010000   0.000000   0.010000 (  0.010674)
What 3 words relate to atom like quantum relates to mechanics?
[["particl", 0.9982711579369483], ["find", 0.9982303885530384], ["expect", 0.9982017117355527]]

TODO

Word Vector graphs

Contributing

Fork it ( https://github.com/vesselinv/glove/fork )
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create a new Pull Request

vesselinv/glove