rapaio

Statistics, data mining and machine learning toolbox in Java.

Rapaio Manual is published on gitbooks.com. Using the previous link you can select the format of manual. To read it directly online you can use this direct link: Rapaio Manual - read on-line.

NEW: The manual contains a Tutorial on Kaggle's Titanic Competition!

Acknowledgements

Many thanks to ej-technologies GmbH for providing an open source license for their Java Profiler and to JetBrains for providing an open source license for their Java IDE.

Stable Features

For each feature there are some notes regarding the development stage. If there are no notes it means the feature is considered to be fully implemented and well tested.

Core

Special Math functions
Maximum, Minimum, Mode (only for nominal values), Sum, Mean, Variance, Quantiles
Online Statistics: minimum, maximum, count, mean, variance, standard deviation, skewness, kurtosis

Core tools

DVector
DTable
Distance Matrix

Correlations

Pearson product-moment coefficient
Spearman's rank correlation coefficient

Distributions

Bernoulli
Binomial
Poisson
Normal/Gaussian
Student t
ChiSquare
Discrete Uniform
Continuous Uniform
Hypergeometric
Gamma
Empirical KDE (gaussian, epanechnikov, cosine, tricube, biweight, triweight, triangular, uniform)

Sampling

SamplingTool
- generates discrete integer samples with/without replacement, weighted/non-weighted
- offers utility methods for bootstraps, simple random, stratified sampling
Samplers used in machine learning algorithms

Hypothesis Testing

z test
- one sample test for testing the sample mean
- two unpaired samples test for testing difference of the sample means
- two paired samples test for testing sample mean of the differences
t test
- one sample test for testing the sample mean
- two unpaired samples t test with same variance
- two unpaired samples Welch t test with different variances
- two paired samples test for testing sample mean of differences
Kolmogorov Smirnoff KS test
- one sample test for testing if a sample belongs to a distribution
- two samples test for testing if both samples comes from the same distribution
Pearson Chi-Square tests
- goodness of fit
- independence test
- conditional independence test
Anderson-Darling goodness of fit
- normality test

Frame Filters

FFJitter - add jitter to data according with a noise distribution
FFAddIntercept - add an intercept variable to a given data set
FFMapVars - select some variables according with a VRange pattern
FFRemoveVars - removes some variables according with a VRange pattern
FFStandardize - standardize variables from a given data frame
FFRandomProjection - project a data frame onto random projections

Var filters

VFCumulativeSum - builds a numeric vector with a cumulative sum
VFJitter - adds noise to a given numeric vector according with a noise distribution
VFRefSort - sorts a variable according with a given set of row comparators
VFShuffle - shuffles values from a variable
VFSort - sorts a variable according with default comparator
VFStandardize - standardize values from a given numeric variable
VFToIndex - transforms a variable into an index type using a lambda
VFToNumeric - transforms a variable into numeric using a lambda
VFTransformPower - transform a variable withg power transform
VFUpdate - updates a variable using a lambda on VSpot
VFUpdateIndex - updates a variable using a lambda on index value
VFUpdateLabel - updates a variable using a lambda on label value
VFUpdateValue - updates a variable using a lambda on double value

Evaluation

Confusion Matrix
Receiver Operator Characteristic - ROC curves and ROC Area
Root Mean Square Error
Mean Absolute Error
Gini / Normalized Gini

Analysis

Fast Fourier Transform
Principal Components Analysis
Fischer Linear Discriminant Analysis

Classification

Bayesian: NaiveBayes (GaussianPdf, EmpiricalPdf, MultinomialPmf)
Linear: BinaryLogistic
Rule: OneRule
Decision Trees - CTree: DecisionStump, ID3, C45, CART
- purity: entropy, infogain, gain ration, gini index
- weight on instances
- split: numeric binary, nominal binary, nominal full
- missing value handling: ignore, random, majority, weighted
- reduced-error pruning
- variable importance: frequency, gain and permutation based
Ensemble: CForest - Bagging, Random Forests
Boosting: AdaBoost.SAMME
SVM: BinarySMO (Platt)

Regression

Simple: ConstantRegression
Simple: L1Regression
Simple: L2Regression
Simple: RandomValueRegressor
Tree: CART (no pruning)
Tree: C45 (no pruning)
Tree: DecisionStump
LinearRegression (multiple targets, only numerical attributes)

Clustering

Cluster Silhouette
KMeans clustering
Minkowski Weighted KMeans

Time Series

Acf (correlation, covariance)
Pacf

Graphics

QQ Plot
Box Plot
Histogram
2d Histogram
Plot function line
Plot vertical/horizontal/ab line
Plot lines
Plot points
Density line KDE
ROC Curve
Discrete Vertical Lines
Segment2D

Matrices and vectors

Numeric vector operations
Basic matrix operations and matrix decompositions

Experminental Stage Features