- Java
- IntelIJ IDEA 2017
- Using regular expression to replace ignore character
\\"|\\r|\\t|\\n -> null
{"type":".+","title":"(.+)","content":" -> $1 + 1 space('$1 ')
","url":".+ -> null
- Using UETSegmenter for word segmentation
NewsClassifySystem(root)
--- data
--- data/dictionary
--- data/pre1
--- data/pre2
--- data/test
--- dictionary
--- models
--- src
--- src/main/java/com/classify/crawler
--- src/main/java/com/classify/dictionary
--- src/main/java/com/classify/preprocess
--- uetsegmenter.jar
--- vnstopword.txt
- Setup java environment
- Clone or download source code
- Edit config in IConfig.java (com.classify.dictionary) - MAX_NUMBER_OF_NEW
- Run com.classify.dictionary.Runner to train the model
- Run com.classify.test.Checker to test the model
- all_news list have size < TOTAL_NEWS : some news when split have no useful words