/CIAMS

A toolkit for automatically selecting a suitable classification model along with an assessment of how hard or easy to learn a classifier on the given dataset.

Primary LanguagePythonBSD 3-Clause "New" or "Revised" LicenseBSD-3-Clause

CIAMS - Clustering Indices based Automatic classification Model Selection

The code for CIAMS is packaged under the name AutoMS.

AutoMS (Automatic Model Selection Using Cluster Indices) is a machine learning model recommendation and dataset classifiability assessment toolkit.

Find the documentation here.

Table of Contents

Overview

AutoMS estimates the maximum achievable f1 scores corresponding to various classifier models for a given binary classification dataset. These estimated scores help make informed choices about the classifier models to experiment on the dataset, and also speculate what to expect from each of them. AutoMS also predicts the classification complexity of the dataset which characterizes the ease with which the dataset can be classified.

AutoMS extracts clustering-based metafeatures from the dataset and uses fitted classification and regression models to predict the classification complexity and estimate the maximum achievable f1-scores corresponding to various classifier models for the dataset.

Note: f1-score in all discussions pertaining to AutoMS refers to a variant of weighted average f1-score for binary datasets from class imbalance learning literature that weights the f1-scores of classes inversely proportional to their proportions in the dataset.

where, R is the class imbalance ratio, which is the fraction of number of samples in the majority class to the number of samples in the minority class.

Installing AutoMS

We recommend installing automs into a virtual environment.

$ sudo pip install virtualenv
$ virtualenv --python=python3.6 automs-venv
$ source automs-venv/bin/activate
$ pip install automs

Tip: If you encounter errors in installing AutoMS, install python3.6-dev system package (which contains the header files and static library for Python) and, then attempt installing automs again.

$ sudo apt-get install python3.6-dev
$ pip install automs

Configuring AutoMS

The default configurations with which to run automs can be configured using the AutoMS Configuration Wizard with:

$ automs-config

The configured defaults can be overriden for each invocation of automs by suppling appropriate arguments to the command-line or python interface.

Running AutoMS on a dataset

Step 1: Downloading the dataset

Download a binary classification dataset of choice (in csv, libsvm or arff format) from the web. In this illustration, we will be using the Connectionist Bench (Sonar, Mines vs. Rocks) Data Set. Download the dataset in csv format from here with:

$ wget https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data 

Change the current working directory to the directory into which the dataset was downloaded. Rename the dataset file to have a '.csv' extension.

$ mv sonar.all-data sonar.csv

Note: AutoMS infers the data format of a dataset file from its filename extension. Therefore, you must rename the dataset file to have a filename extension that corresponds to its data format. Supported filename extensions (and data formats) are '.csv', '.libsvm' and '.arff'.

Step 2: Creating the dataset configuration file

The configuration file for the dataset encodes information about the structure of the dataset file.

Create a dataset configuration file for the dataset in the same directory as the dataset file, with filename same as the dataset filename suffixed with a '.config.py' extension (i.e., in this case sonar.csv.config.py).

$ echo -e "from automs.config import CsvConfig\nconfig = CsvConfig()" > sonar.csv.config.py
$ cat sonar.csv.config.py

For examples of the configuration file content corresponding to variety of dataset files, refer to the examples section in documentation.

Note: For the dataset file sonar.csv, the contents of the dataset configuration file sonar.csv.config.py is:

from automs.config import CsvConfig
config = CsvConfig()

Since, the dataset file in this case is aligned with the default values of the arguments to CsvConfig class, no arguments have been explicitly passed to CsvConfig class in the creation of the config object. However, you may need to override some of the default values of the arguments to your data format specific dataset configuration class in the creation of the config object, to suit to your dataset file.

For information about the dataset configuration classes corresponding to the various data formats and the arguments they accept, refer to API documentation of Dataset Configuration Classes.

Step 3: Predicting Classification Complexity and Estimating F1 scores for the dataset

Command-line Interface

$ automs sonar.csv --oneshot --truef1 --result sonar_results

For the more information about the oneshot and subsampling approaches, refers to What are the oneshot and sub-sampling appeoaches ? and When should I use the oneshot and sub-sampling approaches ? in the FAQ section in documentation.

The predicted classification complexity, estimated f1-score and true f1-score results for the dataset should be available in the sonar_results file after the completion of execution of the program.

$ cat sonar_results

Note: The predicted classification complexity boolean value indicates if the dataset can be classified with a f1-score > 0.6 using any of the classification methods. True indicates that the dataset is hard to to classify and False indicates that the dataset is easy to classify.

The estimated f1-scores corresponding to various classifier models should help identify the candidate top performing classification methods for the dataset, and help reduce the search space of classification algorithms to be experimented on the dataset.

For more information about the AutoMS command line interface and the arguments it accepts, refer to API Documentation for AutoMS command line interface.

$ automs --help

Python Interface

>>> from automs.automs import automs
>>> is_hard_to_classify, estimated_f1_scores, true_f1_scores = automs('sonar.csv', oneshot=True, return_true_f1s=True)
>>> print(f"IS HARD TO CLASSIFY = {is_hard_to_classify}")
>>> print(f"Estimated F1-scores = {estimated_f1_scores}")
>>> print(f"True F1-scores = {true_f1_scores}")

For more information about the AutoMS python interface and the arguments it accepts, refer to API Documentation for AutoMS python interface.

>>> from automs.automs import automs
>>> help(automs)

Tip: Inspect the configured (or specified) warehouse sub-directory corresponding to the last run of AutoMS for result files results.xlsx, predicted_classification_complexity, estimated_f1_scores and true_f1_scores, and the intermediate data subsample files in its bags/ sub-directory.

$ ls <Path to configured AutoMS warehouse>
$ cd <Path to configured AutoMS warehouse>/sonar.csv/
$ tail -n +1 predicted_classification_complexity estimated_f1_scores true_f1_scores
$ xdg-open results.xlsx

Documentation

The AutoMS documentation is hosted at https://automs.readthedocs.io/.

Authors

Acknowledgments