/wikisourceindex

Creating index for wikisource with MapReduce

Primary LanguageJava

wikisourceindex

Creating index for wikisource with MapReduce

Library dom4j is deprecated because we can't put a large file into memory,using sax to trasform xml to txt.

PosProcessor:Calculate position where words in the article and the total counts of words and articles.

DFandTFProcessor:Calculate df and tf info of each word.