This project brings in-DBMS data analytics to Impala. This leverages previous work done by two projects:
-
MADlib (http://madlib.net/)
Each of these projects use User Defined Aggregates (UDAs) to train analytic models using an existing DBMS's data management and processing ability.
-
eigen3(yum install eigen3-devel) -
boost>=1.54.0(manual install on CentOS 6) -
impala-udf-devel(yum-installable with Cloudera repo)
make
python python/deploy.py <relevant options>This is a fork of MADlib 1.0 which has been modified for use with Impala. The specific changes were:
-
madlib/testwith tests for the new code -
madlib/Makefileto make the tests -
madlib/src/ports/metaportwhich is a modified MADlib backend for main memory
To run the example SVM,:
-
Create database
toysvm -
Run
make. If it fails with an error likelib/libsvm.so: undefined reference to 'impala_udf::FunctionContext::Allocate(int)', it's ok. -
Register the UDFs with a database (without re-making the binaries), execute:
python python/deploy.py -p -o /path/for/libs toysvm -
Create a synthetic table of examples in the database
toysvmwith the tabletoy:python python/gen_classify_data.py toysvm toy -
python python/impala_svm.py lbl e0 e1 e2 --db toysvm --table toy -e 1 -
impala-shell -q 'use toysvm; select iter, printarray(decodearray(model)) from history;'
Also see example usage in the impyla repo.