Train character vector representation for Chinese Corpus.
Use data set: zhwiki-20170220-pages-articles1.xml.bz2
Perform WikiExtractor.py to get wiki_00~wiki_07 files.
Use
cat wiki_* > processed_zhwiki.txt
to get these files together.
Perform Traditional Chinese to Simplified Chinese conversion by using openCC.
Common Methods of Chinese convert:
code:
/usr/local/Cellar/opencc/1.0.4/bin/opencc -i processed_zhwiki.txt -o transformed_zh_wiki -c /usr/local/Cellar/opencc/1.0.4/share/opencc/t2s.json
Delete empty brackets caused by using WikiExtractor.py
Execute Tokenization.py to perform segmentation by using Jieba.
Common Methods of segmentation:
Methods of Chinese Segmentation | Algorithm | Related Link |
---|---|---|
Jieba | Based on a prefix dictionary structure to achieve efficient word graph scanning. Build a directed acyclic graph (DAG) for all possible word combinations.Use dynamic programming to find the most probable combination based on the word frequency.For unknown words, a HMM-based model is used with the Viterbi algorithm. | Github |
THULAC(THU Lexical Analyzer for Chinese) | Based on Structured Perceptron | Github paper(2009) |
StanfordSegmenter | Based on CRF | Github Tutorials paper(2005) paper(2008) |
Perform to Word2Vec_train.py to train character vector for Chinese corpus.
Parameter Set:
- sg = 1 # use skip-gram
- hs = 0 and negative=5 # use negative sample not hierarchical softmax
- size = 100 # the dimensionality of the feature vectors
- alpha = 0.025 # learning rate
- window = 5 # content window
- min_count=5 # ignore all words with total frequency lower than 5
- sample = 0.001 # threshold for configuring which higher-frequency words are randomly downsampled; default is 1e-3, useful range is (0, 1e-5).
- batch_words = 10000 # target size for batches of examples passed to worker threads
API document of Word2Vec in gensim
You can review the results at here.