- We write matlab code to train skip-gram negative sampling(SGNS) that is originally provided by the famous NLP toolbox word2vec(https://code.google.com/p/word2vec/).
- Our objective function is equivalent to SGNS, however, we design our algorithm in another way.
- You can take our code(w2vsbd.m) as a matlab implementation of SGNS, which is quite simple.
- Moveover, we provide a supervised explicit matrix factorization(w2vsbdsup.m) that will boost the performance through supervision.
- Download dataset 'enwik9.zip' from http://cs.fit.edu/~mmahoney/compression/textdata.html
- Decompress 'enwik9.zip' and get 'enwik9', then put it in folder './data/'
- Run run_emf.m file in matlab, then you will get a result of the first experiment of our paper
- Run run_semf.m file in matlab, then you will get a result of the second experiment of our paper
- Refer to our paper(https://etali.github.io/papers/EMF-IJCAI2015.pdf) and code
- Red Hat Enterprise Linux Server release 6.2 (64x)
- perl 5.10
- gcc 4.4.5
- matlab R2011a
We adopt word2vec from https://code.google.com/p/word2vec/ to generate co-occurrence matrix, and our algorithm only bases on co-occurrence matrix. Our algorithm is a batch mode alternating minimization that is not as scalable as the algorithm in word2vec, however, it performs as good as skip-gram negative sampling(SGNS) provided by word2vec. We provide the word2vec.c code we used in our project under folder emf/word2vec/, in which we altered several snippets.
Yitan Li, Linli Xu, Fei Tian, Liang Jiang, Xiaowei Zhong, Enhong Chen
University of Science and Technology of China