/HadoopTextCategorization

Primary LanguageJavaApache License 2.0Apache-2.0

HadoopTextCategotization

Distributed Text Categorization

Description

This project classifies the category of a piece of text, given sufficient training data.

This project was created for ECE465 Cloud Computing at Cooper Union, taught by Professor Rob Marano in the Spring 2014 term.

Dependencies

  • Java 7
  • Hadoop 2.x.x
  • Maven 2

Usage

General Syntax

java Main.class [-f|--featurize training-file-dir output-dir] [-t|--trained trained-file-dir testing-dir output-dir labels-file] [training-file-dir testing-dir output-dir labels-file]

To run, use mvn compile exec:exec while setting the appropriate arguments in the maven pom.xml file. Examples of possible arguments are commented out in the pom file.

Options

-f|--featurize Only run the word count mapreduce job on the training articles. Takes the hdfs directory of training articles as well as an hdfs directory to put the featurized output

-t|--trained Run the KNN algorithm on already featurized training data. Takes the hdfs directory of already featurized training data, a local directory of testing articles, an hdfs output directory, as well as a local file of testing article labels

Without any flags, both mapreduce jobs for featurizing and KNN are run taking the hdfs directory of training data, a local directory of testing articles, an hdfs output directory, as well as a local file of testing article labels