OutlierDetection.jl is a Julia toolkit for detecting outlying objects, also known as anomalies. This package is an effort to make Julia a first-class citizen in the Outlier- and Anomaly-Detection community. Why should you use this package?
- Provides a unified API for outlier detection in Julia
- Provides access to state-of-the-art outlier detection algorithms
- Seamlessly integrates with Julia's existing machine learning ecosystem
It is recommended to use Pkg.jl for installation. Follow the command below to install the latest official release or use ] add OutlierDetection
in the Julia REPL.
import Pkg;
Pkg.add("OutlierDetection")
If you would like to modify the package locally, you can use Pkg.develop(OutlierDetection)
or ] dev OutlierDetection
in the Julia REPL. This fetches a full clone of the package to ~/.julia/dev/
(the path can be changed by setting the environment variable JULIA_PKG_DEVDIR
).
You typically want to interface with OutlierDetection.jl through the MLJ-API. However, it's also possible to use OutlierDetection.jl without MLJ. The main parts of the API are the functions fit
, score
, and to
. Note that the raw API uses the columns-as-observations convention for improved performance, and we transpose the input data.
using OutlierDetection
using OutlierDetectionData: ODDS
# create a detector (a collection of hyperparameteres)
lof = LOF()
# download and open the thyroid benchmark dataset
X, y = ODDS.load("thyroid")
# use 50% of the data for training
n_train = Int(length(y) * 0.5)
train, test = eachindex(y)[1:n_train], eachindex(y)[n_train+1:end]
# learn a model from data
model = fit(lof, X[train, :])
# predict outlier scores with learned model
train_scores, test_scores = score(lof, model, X[test, :])
# transform scores to binary labels
ŷ = detect(Class(), train_scores, test_scores)
The main difference between the raw API and MLJ is, besides method naming differences, the introduction of a machine
. In the raw API, we explicitly pass the results of fitting a detector (models) to further score
calls. Machines allow us to hide that complexity by binding data directly to detectors and automatically passing fit results to further transform
(unsupervised) or predict
(supervised) calls. Under the hood, transform
and predict
pass the input data and previous fit result to score
.
using MLJ # or using MLJBase
using OutlierDetection
using OutlierDetectionData: ODDS
# download and open the thyroid benchmark dataset
X, y = ODDS.load("thyroid");
# use 50% of the data for training
n_train = Int(length(y) * 0.5)
train, test = eachindex(y)[1:n_train], eachindex(y)[n_train+1:end]
# create a pipeline consisting of a detector and classifier
pipe = @pipeline LOF() Class()
# create a machine by binding the pipeline to data
mach = machine(pipe, X)
# learn from data
fit!(mach, rows=train)
# predict labels with learned machine
ŷ = transform(mach, rows=test)
Algorithms marked with '✓' are implemented in Julia. Algorithms marked with '✓ (py)' are implemented in Python (thanks to the wonderful PyOD library) with an existing Julia interface through PyCall. If you would like to know more, open the detector reference. Note: If you would like to use a Python-variant of an algorithm, prepend the algorithm name with Py
, e.g., PyLOF
is the Python variant of LOF
.
Name | Description | Year | Status | Authors |
---|---|---|---|---|
LMDD | Linear deviation-based outlier detection | 1996 | ✓ (py) | Arning et al. |
KNN | Distance-based outliers | 1997 | ✓ | Knorr and Ng |
MCD | Minimum covariance determinant | 1999 | ✓ (py) | Rousseeuw and Driessen |
KNN | Distance to the k-th nearest neighbor | 2000 | ✓ | Ramaswamy |
LOF | Local outlier factor | 2000 | ✓ | Breunig et al. |
OCSVM | One-Class support vector machine | 2001 | ✓ (py) | Schölkopf et al. |
KNN | Sum of distances to the k-nearest neighbors | 2002 | ✓ | Angiulli and Pizzuti |
COF | Connectivity-based outlier factor | 2002 | ✓ | Tang et al. |
LOCI | Local correlation integral | 2003 | ✓ (py) | Papadimitirou et al. |
CBLOF | Cluster-based local outliers | 2003 | ✓ (py) | He et al. |
PCA | Principal component analysis | 2003 | ✓ (py) | Shyu et al. |
IForest | Isolation forest | 2008 | ✓ (py) | Liu et al. |
ABOD | Angle-based outlier detection | 2009 | ✓ | Kriegel et al. |
SOD | Subspace outlier detection | 2009 | ✓ (py) | Kriegel et al. |
HBOS | Histogram-based outlier score | 2012 | ✓ (py) | Goldstein and Dengel |
SOS | Stochastic outlier selection | 2012 | ✓ (py) | Janssens et al. |
AE | Auto-encoder reconstruction loss outliers | 2015 | ✓ | Aggarwal |
ABOD | Stable angle-based outlier detection | 2015 | ✓ | Li et al. |
LODA | Lightweight on-line detector of anomalies | 2016 | ✓ (py) | Pevný |
DeepSAD | Deep semi-supervised anomaly detection | 2019 | ✓ | Ruff et al. |
COPOD | Copula-based outlier detection | 2020 | ✓ (py) | Li et al. |
ROD | Rotation-based outlier detection | 2020 | ✓ (py) | Almardeny et al. |
ESAD | End-to-end semi-supervised anomaly detection | 2020 | ✓ | Huang et al. |
If there are already so many algorithms available in Python - why Julia, you might ask? Let's have some fun!
using OutlierDetection, MLJ
using BenchmarkTools: @benchmark
X = rand(100000, 10);
lof = machine(LOF(k=5, algorithm=:balltree, leafsize=30, parallel=true), X) |> fit!
pylof = machine(PyLOF(n_neighbors=5, algorithm="ball_tree", leaf_size=30, n_jobs=-1), X) |> fit!
Julia enables you to implement your favorite algorithm in no time and it will be fast, blazingly fast.
@benchmark transform(lof, X)
> median time: 807.962 ms (0.00% GC)
Interoperating with Python is easy!
@benchmark transform(pylof, X)
> median time: 31.077 s (0.00% GC)
OutlierDetection.jl is a community effort and your help is extremely welcome! See our contribution guide for more information how to contribute to the project.
We are excited to make Julia a first-class citizen in the outlier detection community and happily accept algorithm contributions to OutlierDetection.jl.
We consider well-established algorithms for inclusion. A rule of thumb is at least two years since publication, 100+ citations, and wide use and usefulness. Algorithms that do not meet the inclusion criteria can simply extend our API. The external algorithms can also be listed in our documentation if the authors wish so.
Additionally, algorithms that implement functionality that is useful on their own should live in their own package, wrapped by OutlierDetection.jl. Algorithms that build primarily on top of existing packages can be implemented directly in OutlierDetection.jl.
Thanks go to these wonderful people (emoji key):
David Muhr 💻 |
Páll Haraldsson 📖 |
Anthony Blaom, PhD 💻 |
This project follows the all-contributors specification. Contributions of any kind welcome!