/varaha

Machine learning and natural language processing with Apache Pig

Primary LanguageJavaApache License 2.0Apache-2.0

Varaha

A set of Apache Pig scripts and UDFs (User Defined Functions) for machine learning and natural language processing. Why should Mahout have all the fun?

Build

You’ll want to build the UDFs before doing anything else. To do that simply do:


mvn clean package

The rest

See individual readme files under the scripts directory for how to run.

Why is it called Varaha?

Evidently, Varaha is an avatar of the Hindu god Vishnu, in the form of a Boar.

How do I tokenize and tag text?

register ../../lib/stanford-postagger-withModel.jar
register ../../target/varaha-1.0-SNAPSHOT.jar

reviews = LOAD ‘data/ten.avro’ USING AvroStorage;
foo = FOREACH reviews GENERATE business_id, varaha.text.StanfordTokenize(text) AS tagged;
DUMP foo

reviews = LOAD ‘data/ten.avro’ USING AvroStorage();
reviews = LIMIT reviews 1000;
bar = FOREACH reviews GENERATE business_id, FLATTEN) AS tokenized_sentences;
bar = FOREACH bar GENERATE business_id, varaha.text.StanfordPOSTag(tokenized_sentences) AS tagged;
DUMP bar

reviews = LOAD ‘data/ten.avro’ USING AvroStorage();
reviews = LIMIT reviews 1000;
bar = FOREACH reviews GENERATE business_id, varaha.text.StanfordPOSTag(varaha.text.StanfordTokenize(text)) AS tokens;
DUMP bar