Intro

This is an example Spark application to demonstrate using Support Vector Machine (SVM) to classify data. The particular technique used to find the maximum-margin hyper-plane is subgradient descent (SGD) using the Pegasos algorithm (see references below).

You should most likely use the Apache Spark ML library for SVM techniques to classify data. However, this project's purpose is to show simply, in a self-contained file/project, how one may easily use SGD to learn a maximum-margin hyper-plane.

Data Format

You must use a delimited file (e.g. comma, tab, space, etc...) as input. The delimiter can be set when you run the program. Also, the first column/field must be represent the class with the only either 1 or -1. An trivial example is shown below.

-1, 0, 0
1, 10, 10

HOWTO use

To use this library, use spark-submit as follows.

/path/to/spark-submit \
 --class com.github.vangj.spark.svm.Sgd \
 --master <master-url> \
 --deploy-mode <deploy-mode> \
 /path/to/spark-svm-assembly-0.0.1-SNAPSHOT.jar \
 --T <number-of-iterations> \
 --k <number-of-samples> \
 --lambda <regularization-parameter> \
 --seed <seed-for-randomization> \
 --delim <delimiter-for-input-file> \
 --input <path-of-input-file> \
 --output <path-of-output-file

Notes on parameters.

T is the number of iterations. Specify something like 400. You may get unacceptable classification results if T is too small.
k is the number of samples taken at each iteration. Specify something less than or equal to your sample size.
lambda is the regularization parameter (learning rate). Specify something between [0, 1].
seed is used for the random number generator when sampling.
delim is used to parse your input file.

Building

This project depends on the following.

Java v1.8
Scala v2.11.8

You may use the following tools to build the project.

SBT v0.13.13
Maven v3.3.9

For SBT, type in the following.

sbt assembly

For Maven, type in the following.

mvn package

References

Copyright Stuff

Copyright 2017 Jee Vang

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

vangj/spark-svm