A set of Apache Pig scripts and UDFs (User Defined Functions) for machine learning and natural language processing. Why should Mahout have all the fun?
You’ll want to build the UDFs before doing anything else. To do that simply do:
mvn clean package
See individual readme files under the scripts directory for how to run.
Evidently, Varaha is an avatar of the Hindu god Vishnu, in the form of a Boar.
register ../../lib/stanford-postagger-withModel.jar
register ../../target/varaha-1.0-SNAPSHOT.jar
reviews = LOAD ‘data/ten.avro’ USING AvroStorage;
foo = FOREACH reviews GENERATE business_id, varaha.text.StanfordTokenize(text) AS tagged;
DUMP foo
reviews = LOAD ‘data/ten.avro’ USING AvroStorage();
reviews = LIMIT reviews 1000;
bar = FOREACH reviews GENERATE business_id, FLATTEN) AS tokenized_sentences;
bar = FOREACH bar GENERATE business_id, varaha.text.StanfordPOSTag(tokenized_sentences) AS tagged;
DUMP bar
reviews = LOAD ‘data/ten.avro’ USING AvroStorage();
reviews = LIMIT reviews 1000;
bar = FOREACH reviews GENERATE business_id, varaha.text.StanfordPOSTag(varaha.text.StanfordTokenize(text)) AS tokens;
DUMP bar