Wikipedia Word2Vec

This repository uses gensim to train word2vec from wiki dump, especially for Chinese Wikipedia data.

1 Download wiki data from Wikimedia Downloads

2 Use wikiextractor to extract and cleans wiki text.

git clone https://github.com/attardi/wikiextractor.git
cd wikiextractor
python WikiExtractor.py -i path/to/wiki_dump -o path/to/extract_dir

3 For Chinese user, we might want to convert traditional Chinese to simplified Chinese. OpenCC or pure python version OpenCC-Python could be used. Use following script to make the conversion.

python opencc_zhwiki_t2s.py path/to/extract_dir path/to/convert_dir 

4 Finnally we could train the vector, we borrow the tokenizer from BERT

python wiki_word2vec.py path/to/extract_dir-or-convert_dir path/to/word2vec

TODO

  • Shell we create an embedding for <UNK>?