This is an example Spark application to demonstrate using Support Vector Machine (SVM) to classify data. The particular technique used to find the maximum-margin hyper-plane is subgradient descent (SGD) using the Pegasos algorithm (see references below).
You should most likely use the Apache Spark ML library for SVM techniques to classify data. However, this project's purpose is to show simply, in a self-contained file/project, how one may easily use SGD to learn a maximum-margin hyper-plane.
You must use a delimited file (e.g. comma, tab, space, etc...) as input. The delimiter can be set when you run the program. Also, the first column/field must be represent the class with the only either 1 or -1. An trivial example is shown below.
-1, 0, 0
1, 10, 10
To use this library, use spark-submit as follows.
/path/to/spark-submit \
--class com.github.vangj.spark.svm.Sgd \
--master <master-url> \
--deploy-mode <deploy-mode> \
/path/to/spark-svm-assembly-0.0.1-SNAPSHOT.jar \
--T <number-of-iterations> \
--k <number-of-samples> \
--lambda <regularization-parameter> \
--seed <seed-for-randomization> \
--delim <delimiter-for-input-file> \
--input <path-of-input-file> \
--output <path-of-output-file
Notes on parameters.
- T is the number of iterations. Specify something like 400. You may get unacceptable classification results if T is too small.
- k is the number of samples taken at each iteration. Specify something less than or equal to your sample size.
- lambda is the regularization parameter (learning rate). Specify something between [0, 1].
- seed is used for the random number generator when sampling.
- delim is used to parse your input file.
This project depends on the following.
- Java v1.8
- Scala v2.11.8
You may use the following tools to build the project.
- SBT v0.13.13
- Maven v3.3.9
For SBT, type in the following.
sbt assembly
For Maven, type in the following.
mvn package
- Pegasos: Primal Estimated sub-GrAdient SOlver for SVM
- Pegasos: Primal Estimated sub-GrAdient SOlver for SVM
- The Stochastic Gradient Descent for the Primal L1-SVM Optimization Revisited
- Large-Scale Support Vector Machines: Algorithms and Theory
- scikit-learn
- Support vector machine
- Subgradient method
- Spark ML Linear Methods
Copyright 2017 Jee Vang
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.