Word2Vec_for_Chinese_Corpus

Train character vector representation for Chinese Corpus.

Processing Outline

Use data set: zhwiki-20170220-pages-articles1.xml.bz2

Perform WikiExtractor.py to get wiki_00~wiki_07 files.

Use

cat wiki_* > processed_zhwiki.txt

to get these files together.

Perform Traditional Chinese to Simplified Chinese conversion by using openCC.

Common Methods of Chinese convert:

OpenCC: Github

Wikipedia: Wikipedia

code:

/usr/local/Cellar/opencc/1.0.4/bin/opencc   -i processed_zhwiki.txt  -o transformed_zh_wiki -c /usr/local/Cellar/opencc/1.0.4/share/opencc/t2s.json

Delete empty brackets caused by using WikiExtractor.py

Execute Tokenization.py to perform segmentation by using Jieba.

Common Methods of segmentation:

Methods of Chinese Segmentation	Algorithm	Related Link
Jieba	Based on a prefix dictionary structure to achieve efficient word graph scanning. Build a directed acyclic graph (DAG) for all possible word combinations.Use dynamic programming to find the most probable combination based on the word frequency.For unknown words, a HMM-based model is used with the Viterbi algorithm.	Github
THULAC(THU Lexical Analyzer for Chinese)	Based on Structured Perceptron	Github paper(2009)
StanfordSegmenter	Based on CRF	Github Tutorials paper(2005) paper(2008)

Perform to Word2Vec_train.py to train character vector for Chinese corpus.

Parameter Set:

sg = 1 # use skip-gram
hs = 0 and negative=5 # use negative sample not hierarchical softmax
size = 100 # the dimensionality of the feature vectors
alpha = 0.025 # learning rate
window = 5 # content window
min_count=5 # ignore all words with total frequency lower than 5
sample = 0.001 # threshold for configuring which higher-frequency words are randomly downsampled; default is 1e-3, useful range is (0, 1e-5).
batch_words = 10000 # target size for batches of examples passed to worker threads

You can review the results at here.