This repository uses gensim to train word2vec from wiki dump, especially for Chinese Wikipedia data.
1 Download wiki data from Wikimedia Downloads
2 Use wikiextractor to extract and cleans wiki text.
git clone https://github.com/attardi/wikiextractor.git
cd wikiextractor
python WikiExtractor.py -i path/to/wiki_dump -o path/to/extract_dir
3 For Chinese user, we might want to convert traditional Chinese to simplified Chinese. OpenCC or pure python version OpenCC-Python could be used. Use following script to make the conversion.
python opencc_zhwiki_t2s.py path/to/extract_dir path/to/convert_dir
4 Finnally we could train the vector, we borrow the tokenizer from BERT
python wiki_word2vec.py path/to/extract_dir-or-convert_dir path/to/word2vec
- Shell we create an embedding for <UNK>?