
A hadoop programm analyze xml file containing large corpus of wikipedia pages and filter the pages with certain keywords.

Primary LanguageJavaApache License 2.0Apache-2.0


A hadoop programm analyze xml file containing large corpus of wikipedia pages and filter the pages with certain keywords(case insensitive).

hadoop jar textfilter-0.0.1-SNAPSHOT.jar input outpu keyword1 keyword2 keyword3