SCoL and SImbCoL

The Simulated Complexity Library (SCoL) and its version for binary imbalanced datasets SImbCoL are packages that simulate a set of complexity measures using models generated by a meta-learning approach to decrease the asymptotic computational complexity of the original measures for classification problems. SCoL estimates the value of the complexity measures available on ECoL and SImbCoL estimates the values of the complexity measures available on ImbCoL [5]. The simulation is made by the models induced from simple and efficient meta-features implemented by mfe package [2]. The simulated complexity measures capture aspects that quantify the linearity of the data, the presence of informative feature, the sparsity and dimensionality of the datasets with a low computational cost.

Measures

The measures on SImbCoL are decompositions per class of the measures originally proposed by Ho and Basu [3] and extend by many other works including the ECoL library [4]. The implementation of the decomposed measures can be found at ImbCoL [5]. The measures are based on: feature overlapping measures, neighborhood measures, linearity measures, dimensionality measures, class balance measures and network measures. These measured are simulated by models generated by Random Forest and Support Vector Machines algorithms.

Installation

The installation process using devtools is:

if (!require("devtools")) {
    install.packages("devtools")
}
devtools::install_github("victorhb/SImbCoL")
library("SImbCoL")

Example of use

The simplest way to compute the simulated complexity measures are using the simulated method. The method can be called by a symbolic description of the model or by a data frame. The parameters are the dataset and the measures to be extracted. The default paramenter is extract all the measures. A simple example is given next:

# Getting a dataset
library("mlbench")
data(PimaIndiansDiabetes)

## Extract all complexity measures 
SImbCoL::simulated(diabetes ~ ., PimaIndiansDiabetes)

## Extract all complexity measures using data frame
SImbCoL::simulated(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9])

## Extract the N3 measure for the positive class
SImbCoL::simulated(diabetes ~ ., PimaIndiansDiabetes, features = c("N3.P"))

Developer notes

The implementation of SImbCoL is based on the implementation of SCoL. We suggest using the namespace SImbCoL:: when using both packages to avoid conflict.

To submit bugs and feature requests, report at project issues.

To cite SImbCoL in publications use:

Barella, V. H., Garcia, L. P., & de Carvalho, A. C. (2020, October). Simulating Complexity Measures on Imbalanced Datasets. In Brazilian Conference on Intelligent Systems (pp. 498-512). Springer, Cham.

References

[1] Barella, V. H., Garcia, L. P., & de Carvalho, A. C. (2020, October). Simulating Complexity Measures on Imbalanced Datasets. In Brazilian Conference on Intelligent Systems (pp. 498-512). Springer, Cham.

[2] Rivolli, A., Garcia, L. P. F., Soares, C., Vanschoren, J., and de Carvalho, A. C. P. L. F. (2018). Towards Reproducible Empirical Research in Meta-Learning. arXiv:1808.10406

[3] Ho, T., and Basu, M. (2002). Complexity measures of supervised classification problems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(3):289-300.

[4] Lorena, A. C., Garcia, L. P. F., Lehmann, J., de Souto, M. C. P., and Ho, T. K. (2018). How Complex is your classification problem? A survey on measuring classification complexity. arXiv:1808.03591

[5] Barella, V. H., Garcia, L. P., de Souto, M. P., Lorena, A. C., and De Carvalho, A. (2018, July). Data Complexity Measures for Imbalanced Classification Tasks. In 2018 International Joint Conference on Neural Networks (IJCNN) (pp. 1-8). IEEE.