richardsun-voyager/ChineseLanguageProcessing

Segment texts in Chinese and Classify them using machine learning methods

Jupyter Notebook

ChineseTextCalssification

This project aims to handle classify Chinese document classification problems. Unlike English, whose sentences consist of words separated by blanks and punctuation, Chinese texts consists of characters joined one by one without blanks, and the encoding problem is a little bit complex. And the work:

Read Chinese texts through 'gbk' encoding method
Segmented Chinese texts into terms by Jieba package which makes use of CRF.
Extract features by transforming texts into vectors of TfIdf values of Chinese terms
Train and classify the texts based on the features extracted above

Environment:

Python 3.5.2
Scikit-learn 0.18.2
numpy

Dataset:

Fudan Chinese Documents for classification
training data: http://download.csdn.net/detail/github_36326955/9747927
testing data: http://download.csdn.net/detail/github_36326955/9747929