Naive Bayes classifier. Currently 3 types of NB are supported:
- MultinomialNB - Assumes variables have a multinomial distribution. Good for text classification. See
examples/nums.jl
for usage. - GaussianNB - Assumes variables have a multivariate normal distribution. Good for real-valued data. See
examples/iris.jl
for usage. - HybridNB - A hybrid empirical naive Bayes model for a mixture of continuous and discrete features. The continuous features are estimated using Kernel Density Estimation.
Note: fit/predict methods take
Dict{Symbol/AstractString, Vector}
rather than aMatrix
. Also, discrete features must be integers while continuous features must be floats. If all features are continuousMatrix
input is supported.
Since GaussianNB
models multivariate distribution, it's not really a "naive" classifier (i.e. no independence assumption is made), so the name may change in the future.
As a subproduct, this package also provides a DataStats
type that may be used for incremental calculation of common data statistics such as mean and covariance matrix. See test/datastatstest.jl
for a usage example.
###Examples:
-
Continuous and discrete features as
Dict{Symbol, Vector}}
f_c1 = randn(10) f_c2 = randn(10) f_d1 = rand(1:5, 10) f_d2 = randn(3:7, 10) training_features_continuous = Dict{Symbol, Vector{Float64}}(:c1=>f_c1, :c2=>f_c2) training_features_discrete = Dict{Symbol, Vector{Int}}(:d1=>f_d1, :d2=>f_d2) #discrete features as Int64 hybrid_model = HybridNB(labels) # train the model fit(hybrid_model, training_features_continuous, training_features_discrete, labels) # predict the classification for new events (points): features_c, features_d y = predict(hybrid_model, features_c, features_d)
Alternatively one can skip declaring the model and train it directly:
model = train(HybridNB, training_features_continuous, training_features_discrete, labels) y = predict(hybrid_model, features_c, features_d)
-
Continuous features only as a
Matrix
X_train = randn(3,400); X_classify = randn(3,10) hybrid_model = HybridNB(labels) # the number of discrete features is 0 so it's not needed fit(hybrid_model, X_train, labels) y = predict(hybrid_model, X_classify)
-
Continuous and discrete features as a
Matrix{Float}
#X is a matrix of features # the first 3 rows are continuous training_features_continuous = restructure_matrix(X[1:3, :]) # the last 2 rows are discrete and must be integers training_features_discrete = map(Int, restructure_matrix(X[4:5, :])) # train the model hybrid_model = train(HybridNB, training_features_continuous, training_features_discrete, labels) # predict the classification for new events (points): features_c, features_d y = predict(hybrid_model, features_c, features_d)
It is useful to train a model once and then use it for prediction many times later. For example, train your classifier on a local machine and then use it on a cluster to classify points in parallel.
There is support for writing HybridNB
models to HDF5 files via the methods write_model
and load_model
. This is useful for interacting with other programs/languages. If the model file is going to be read only in Julia it is easier to use JLD.jl for saving and loading the file.