/HadoopScalableAnalytics

The open source tools in this repository have been designed for scalable analytics on a variety of data types using the Hadoop and Mapreduce environments

Primary LanguageJava

The open source tools in this repository have been designed for scalable analytics on a variety of data types using the Hadoop and Mapreduce environments. In order to use this tool, the Apache Hadoop software library and its modules should be first downloaded and installed (available at http://hadoop.apache.org/).

The source code is divided into 2 directories. One directory “Structured and Semi-Structured Data Biomedical and EHR” contains code to do scalable analytics on structured and semi-structured data. Use cases addressed in this version of the tool include biomedical data classification, and classification of documents written in natural text.

The second directory “Unstructured Image and Video Data Processing” contains source code for truly scalable processing of complex image and video datasets on a Hadoop Mapreduce framework. Use cases addressed in this version of the tool include scalable detection of humans, and background-foreground segmentation in massive video collections, and face detection and facial feature detection in large image databases. Our custom unstructured data analytics solution for Hadoop Mapreduce relies on a few other tools, including OpenCV, JavaCV, Xuggler, and FFMPEG. The instructions to install the other tools is provided in the document “Tools Installation.doc” and “OpenCV-installation-instructions.docx “ in this subdirectory.