Golang "native" implementation of word2vec algorithm (word2vec++ port)
The library implements the word2vec algorithm for Golang, leveraging the native runtime without relying on external servers or Python dependencies. This Golang module uses a CGO bridge to integrate Max Fomichev's word2vec C++ library.
Despite the availability of commercial and open-source LLMs, word2vec and its derivatives remain prevalent in building niche applications, particularly when dealing with private datasets. Python, along with the gensim
library, is widely adopted as the quickest means to explore the model for production-grade workloads. Gensim is optimized for performance through the use of C, BLAS, and memory-mapping. However, if your application demands even greater speed, such as performing 94K embeddings calculations per second on a single core, native development becomes the optimal solution.
Our objective is to provide a solution that allows the execution of the Word2Vec model natively within a Golang application, eliminating the need to wrap gensim
as a sidecar.
Evaluating existing Golang implementations led us to the promising options. However, performance constraints on the UMBC corpus sparked a pursuit of native C integration. We birthed this library after Max Fomichev's C++ implementation as prominent cross platform solution.
Read more in the blog post Blazing Fast Text Embedding With Word2Vec in Golang to Power Extensibility of Large Language Models (LLMs)
- Inspirations
- Getting started
- Usage Command line utility
- Usage Golang module
- How To Contribute
- License
- References
The project offers a solution as both a Golang module and a simple command-line application. Use the command-line tool to train word2vec models and the Golang module to compute embeddings and find similar words within your application.
A dynamically linked library is required for the CGO bridge to integrate with Max Fomichev's word2vec C++ library. Ensure that the necessary C++ libraries are installed and properly configured on your system to use this functionality.
To build the required dynamically linked library, use a C++11 compatible compiler and CMake 3.1 or higher. This step is essential before proceeding with the installation and usage of the Golang module.
mkdir _build && cd _build
brew install cmake
cmake -DCMAKE_BUILD_TYPE=Release ../libw2v
make
cp ../libw2v/lib/libw2v.dylib /usr/local/lib/libw2v.dylib
Note: The project does not currently distribute library binaries, though this feature is planned for a future version. You will need to build the binaries yourself for your target runtime. If you need assistance, please raise an issue.
You can install application from source code but it requires Golang to be installed.
go install github.com/fogfish/word2vec/w2v@latest
The library uses memory-mapped files, enabling extremely fast sequential reading and writing. However, this approach means that the model file format is not compatible with other libraries. Therefore,it is absolutely necessary to train the model using this library if you plan to utilize its functionality.
To start training, begin by configuring the model with the desired parameters:
w2v train config > config.yaml
The default arguments provide satisfactory results for most text corpora:
- word vector dimension 300
- context window 5 words
- 5 training epoch with 0.05 learning rate
- skip-gram architecture
- negative sampling 5
See the article Word2Vec: Optimal hyperparameters and their impact on natural language processing downstream tasks for consideration about training options.
The repository contains the book "War and Peace" by Leo Tolstoy. We have also used stop words to increase accuracy.
w2v train -C config.yaml \
-o wap-v300_w5_e5_s1_h005-en.bin \
-f ../doc/leo-tolstoy-war-and-peace-en.txt
We recommend naming the output model based on the parameters used during training. Use the following format for naming:
v
for vector sizew
for context window sizee
for number of training epochss1
for skip-gram architecture ors0
for CBOWh1
for hierarchical softmax orh0
for negative sampling following with size digits For example, a model trained with a vector size of 300, a context window of 5, 10 epochs, using the skip-gram architecture and negative sampling could be namedv300_w5_e10_s1_h1.bin
.
Calculate embeddings for either a single word or a bag of words. Create a file where each line contains either a single word or a paragraph. The utility will then output a text document where each line contains the corresponding vector for the given text.
echo "
alexander
emperor
king
tsar
the emperor alexander
" > bow.txt
w2v embedding \
-m wap-v300_w5_e5_s1_h005-en.bin \
bow.txt
The word2vec model allows users to find words that are most similar to a given word based on their vector representations. By calculating the similarity between word vectors, the model identifies and retrieves words that are closest in meaning or context to the input word.
w2v lookup \
-m wap-v300_w5_e5_s1_h005-en.bin \
-k 10 \
alexander
The latest version of the module is available at main
branch. All development, including new features and bug fixes, take place on the main
branch using forking and pull requests as described in contribution guidelines. The stable version is available via Golang modules.
Use go get
to retrieve the library and add it as dependency to your application.
go get -u github.com/fogfish/word2vec
Calculate embeddings for either a single word or a bag of words.
import "github.com/fogfish/word2vec"
// 1. Load model
w2v, err := word2vec.Load("wap-v300_w5_e5_s1_h005-en.bin", 300)
// 2. Allocated the memory for vector
vec := make([]float32, 300)
// 3. Calculate embeddings for the document
doc := "the emperor alexander"
err = w2v.Embedding(doc, vec)
See the example or try it our via command line
Find words that are most similar to a given word based on their vector representations.
import "github.com/fogfish/word2vec"
// 1. Load model
w2v, err := word2vec.Load("wap-v300_w5_e5_s1_h005-en.bin", 300)
seq := make([]word2vec.Nearest, 30)
w2v.Lookup("alexander", seq)
See the example
The library is MIT licensed and accepts contributions via GitHub pull requests:
- Fork it
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Added some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create new Pull Request
The build and testing process requires Go version 1.21 or later.
The commit message helps us to write a good release note, speed-up review process. The message should address two question what changed and why. The project follows the template defined by chapter Contributing to a Project of Git book.
If you experience any issues with the library, please let us know via GitHub issues. We appreciate detailed and accurate reports that help us to identity and replicate the issue.