w2v-sembei [1] is a C++ implementation of word segmentation-free version of word2vec [2].
It requires gcc(>=5).
git clone https://github.com/oshikiri/w2v-sembei.git --recursive
cd w2v-sembei/
mkdir output
make
./w2v-sembei 1000 10000 10000 10000 --corpus sample.txt --window 1 --dim 50
The outputs are
- list of n-grams (
output/vocabulary.csv
) - vector representation of n-grams (
output/embeddings_words.csv
)
- Oshikiri, T. (2017). Segmentation-Free Word Embedding for Unsegmented Languages. In Proceedings of EMNLP2017. [pdf, bib]
- Mikolov, T., Corrado, G., Chen, K., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. In Proceedings of ICLR2013. [code]
- shimo-lab/sembei - GitHub