/hazelcast-jet-ml

Hazelcast Jet machine learning algorithms

Primary LanguageJavaApache License 2.0Apache-2.0

Hazelcast JET ML

Machine learning algorithms using the distributed computing platform Hazelcast JET.

Use JetMLDemo as example of usage of the Jet ML Pipeline.

Installation

git clone https://github.com/selvinsource/hazelcast-jet-ml.git
cd hazelcast-jet-ml
mvn clean compile test assembly:single

Documentation

The Jet ML Pipeline allows to chain Estimators and Transformers.

  • The Estimator is an algorithm that returns a Transformer given a dataset to fit
  • The Transformer is an ML model that transforms one dataset into another
  • A dataset is represented by n Hazelcast IListJet (which is not distributed, in a future version this will be converted to a distributed IMapJet)

Inspired by scikit-learn, see paper.

Datasets

The following datasets have been used:

K-Means Clustering Examples

Train a model and show identified clusters

// Create two Jet members
JetInstance instance1 = Jet.newJetInstance();
Jet.newJetInstance();

// Get a training dataset (it is assumed this is already populated, e.g. from a file)
IListJet<double[]> trainDataset = instance1.getList("trainDataset");

// Train a model using the train dataset, k = 3, maxIter = 20
// k = 3 the number of desired clusters
// maxIter = 20 maximum number of iteration if not converging
KMeans kMeans = new KMeans(3, 20);
KMeansModel model = kMeans.fit(trainDataset);

// Show the identified centroids
LOGGER.info("Centroids:");
model.getCentroids().stream().forEach(c -> LOGGER.info(Arrays.toString(c)));

Jet.shutdownAll();

Train a model and predict test data using Jet ML Pipeline

// Create two Jet members
JetInstance instance1 = Jet.newJetInstance();
Jet.newJetInstance();
 
// Get datasets to train the model and then test it
IListJet<double[]> trainDataset = instance1.getList("trainDataset");
IListJet<double[]> testDataset = instance1.getList("testDataset");

// Create a KMeans estimator
Estimator<double[]> estimator = new KMeans(3, 20);

// Hazelcast Get ML Pipeline: given a train dataset the estimator (KMeans) returns a transformer (KMeanModel) which assigns clusters to test dataset instances
IListJet<double[]> outputDataset = estimator.fit(trainDataset).transform(testDataset);

Jet.shutdownAll();

K-Means Clustering Demo

java -jar target/hazelcast-jet-ml-0.6.1-jar-with-dependencies.jar KMeans

See demo full code.