madlibport: A C++ repository from Cloudera

MADlib Port

This project brings in-DBMS data analytics to Impala. This leverages previous work done by two projects:

Each of these projects use User Defined Aggregates (UDAs) to train analytic models using an existing DBMS's data management and processing ability.

make
python python/deploy.py <relevant options>

This is a fork of MADlib 1.0 which has been modified for use with Impala. The specific changes were:

To run the example SVM,:

Create database toysvm
Run make. If it fails with an error like lib/libsvm.so: undefined reference to 'impala_udf::FunctionContext::Allocate(int)', it's ok.
Register the UDFs with a database (without re-making the binaries), execute: python python/deploy.py -p -o /path/for/libs toysvm
Create a synthetic table of examples in the database toysvm with the table toy: python python/gen_classify_data.py toysvm toy
python python/impala_svm.py lbl e0 e1 e2 --db toysvm --table toy -e 1
impala-shell -q 'use toysvm; select iter, printarray(decodearray(model)) from history;'

Also see example usage in the impyla repo.