Heterogeneous Ensemble Machine Learning Open Classification Kit (HEMLOCK)
HEMLOCK is a software tool for constructing, evaluating, and applying heterogeneous ensemble data models for use in solving supervised machine learning problems. Specifically, the main class of problems targeted by HEMLOCK is the problem of multiple-class classification (also called labeling or categorization) of data with continuous or discrete features. HEMLOCK consists of various data readers, machine learning algorithms, model combination and comparison routines, evaluation methods for model performance testing, and interfaces to external, state-of-the-art machine learning software libraries.
Compiling
In order to compile Hemlock, an Ant build file, build.xml
is provided.
Provided an Apache Ant framework is installed on the machine, the following
command, executed from the HEMLOCK
directory, should completely build the
project: ant
.
In order to interface with Weka, weka.jar must be in the class path or in the
HEMLOCK/tpl
directory at the time of compilation. You will get a warning
message if weka.jar is not in either of those locations when running the Ant
build file.
Installation
The project must be built before using as the executables are not distributed. See the section titled "Compiling" for more information.
In order to interface with Weka, weka.jar
must be in the class path or in the
HEMLOCK/tpl
directory while running Hemlock. If it is not, then any
experiments that request the use of Weka will not be executed and an error
message will be displayed.
Adding Data Sets
Hemlock can only import one type of data set. It is a modification of the C45
file format. Each data set must have a *.name
file and *.data
file. The
first line of the *.name
file is a space separated list of class labels.
This is followed by an empty line and then one line per attribute where each
line contains continuous
for continuous attributes or discrete
followed by
a space separated list of possible values. Both the names file and the data
file must have the same name and be placed in a folder with the same name. All
such data directories should be put in a data repository directory. A data
repository directory is nothing more than a directory which only contains data
directories formatted using the format just described. An entry in the
HEMLOCK/.config
file must be added to point to the data repository directory
you have created. By default the HEMLOCK/data
directory is already setup as
a data repository so data directories can be immediately dropped in that
location for use by Hemlock.
Running Hemlock
Use runHemlock [inputPath] [outPutPath]
to run Hemlock. The two arguments
are required.
inputPath
: path for experiment file to be runoutputPath
directory for result files to be written to