The dcme
package provides functions to compute data complexity measures.
dcme
is under development and not yet available on CRAN. You can install the
development version using the remotes
package as follows:
# install.packages("remotes")
remotes::install_github("fdavidcl/dcme")
The following complexity measures are currently implemented:
num_examples
: Number of Observationsnum_features
: Number of Featuresnum_features_numeric
: Number of Numeric Featuresnum_features_binary
: Number of Binary Featuresnum_features_categorical
: Number of Categorical Featuresnum_classes
: Number of Classesproportion_features_numeric
: Proportion of Numeric Featuresproportion_features_binary
: Proportion of Binary Featuresproportion_features_categorical
: Proportion of Categorical Features
sd_ratio
: Geometric Mean Ratio of Standard Deviationscorr_abs
: Mean Absolute Correlation Coefficient
num_examples_majority
: Number of Observations in the Majority Classnum_examples_minority
: Number of Observations in the Minority Classproportion_examples_majority
: Proportion of Majority Examplesproportion_examples_minority
: Proportion of Minority ExamplesIR
: Imbalance Ratio (binary)C1
: Entropy of Class ProportionsC2
: Imbalance Ratio (multiclass)
num_examples_majority
, num_examples_minority
, proportion_examples_majority
, proportion_examples_minority
, and IR
are defined only for binary data sets.
F1
: Fisher's Discriminant RatioF2
: Volume of Overlap RegionF3
: Maximum Individual Feature Efficiency
Unfortunately these measures are implemented only for binary data sets.
Not implemented yet: F4 (Collective Feature Efficiency)
N2
: Ratio of Average Intra/Inter Class 1-NN DistanceN3
: Error Rate of 1-NN Classifier
Not implemented yet: N1 (Fraction of Borderline Points)
N4
: Nonlinearity of the 1-NN Classifier
Not implemented yet: T1 (Fraction of Hyperspheres Covering Data)
T2
: Average Number of Points per DimensionT3
: Average Number of Points per PCA DimensionT4
: Ratio Between PCA Dimension and Original Dimension
Not implemented yet: Density, ClsCoef, Hubs
Not implemented yet: L1 (Sum of the Error Distance by Linear Programming), L2 (Error Rate of Linear Classifier), L3 (Non-Linearity of a Linear Classifier)
Definitions and explanations of most functions implemented in the dcme
package can be found in the following literature:
[1] Michie, D., Spiegelhalter, D. J., & Taylor, C. C. (1994). Machine learning, neural and statistical classification.
[2] Ho, T. K., & Basu, M. (2002). Complexity measures of supervised classification problems. IEEE transactions on pattern analysis and machine intelligence, 24(3), 289-300.