This software has been further developed by Diviya Thilakeswaran, the repository can be found here.
[Features, background, about, etc.]
The following will need to be installed before installing Chameleon:
- Python 3.6+
Clone the repository to your desired directory and simply run
$ pip install .
or, if you plan on doing development
$ pip install -e .[dev]
The basic steps in using Chameleon are:
- Import and format a dataset with
chameleon-data format
, - Create k-fold cross-validation partitions with
chameleon-data kfold
, - Configure the full pipeline with
chameleon-pipe pipe
. Alternatively, individually configure the pipeline usingchameleon-pipe features
to add feature selection algorithms andchameleon-pipe classifiers
to add classifiers. - Run the pipeline with
chameleon run
.
Every feature selection method has been configured to return all features in a list ordered by importance in descending order.
Since Chameleon is currently in the prototype stage, there are very strict requirements for data input:
- Features and targets must be in one of the following formats:
.mat
- A single file in the same format as the biological datasets seen here..arff
- A single file with the targets as the last attribute.
- All features must be continuous.
- Targets must be binary.
- No missing values.
Say we have a .mat
file located at /path/to/data.mat
. First, we need to extract the data from this file so it is in the correct format for chameleon.
$ chameleon-data format --ftype mat --fpath /path/to/data.mat --name foo
This command will produce two files. One for the features X_foo.pkl
and another for the targets y_foo.npy
. They will be located in the proc_data
folder of the current working directory. The folder will be created if it doesn't already exist.
Chameleon currently uses k-fold cross-validation, so we are now going to randomly partition the data into k folds to make k train/test sets.
$ chameleon-data kfold --xfile proc_data/X_foo.pkl --yfile proc_data/y_foo.npy --name myproblem
The default k is 5, but can be set using --k
. The k-fold method is stratified by default but can be suppressed using the --not-stratified
flag. Similarly, the data will be normalised by default unless the --no-normalise
flag is included.
This command will create a new folder in the current directory called kfold_myproblem
that will contain k files in the format fold1of*k*.pkl
. These files contain the unique sets of training and test data for the fold.
The chameleon utility provides a command line option for quickly setting up a config file for your new problem that specifies the feature selection algorithms and the classifiers that you want to run for the data files. The config file can be manually edited without issue. To start, create the config file with the default feature selection algorithms using the following command:
$ chameleon-pipe features -d kfold_myproblem
Where -d
specifies the directory containing your problem. This will create a pipeline configuration file that stores the data file names, feature selection algorithms, and classifiers. The config file is saved in the kfold_myproblem/configs
directory as a json file named pipe.json
. The file name can be specified using the --name
(-n) option.
By default, all six of the feature selection algorithms will be added, but this can be overidden using the --featureselector
(-f) option to specify single algorithms. For example, if you just want to use SVM-RFE and iterative_MI, you would instead run:
$ chameleon-pipe features -d kfold_myproblem -f SVM-RFE -f iterative_MI
Currently, chameleon supports adding to the config file by running the above command multiple times when specifying an existing problem directory and config name. The only way to remove added parameters is to manually edit the config file.
Add classifiers to the pipeline in the config file using the following command:
$ chameleon-pipe classifiers -p kfold_myproblem/configs/pipe.json
Where -p
specifies the relative path to the config file. By default, every available classifier will be added to the pipeline, but this can be overriden by using the --classifier
(-c) option to specify single classifiers.
Run the test suite with the config parameters with:
$ chameleon run -d kfold_myproblem -p kfold_myproblem/configs/pipe.json
The folowing has not been implemented and may be subject to change
By default, this will be inefficient and run each file + algorithm specified in the config one-by-one. This is usualy okay for most use cases, but could be result in extreme runtimes for some cases (e.g. SVM-RFE). The --method
(-m) option can improve on this, with -m slurm
submitting each algorithm + classifier + file combination as a job through the Slurm batch system.
This section describes all chameleon commands, sub-commands and options.
Command | Description |
---|---|
chameleon-data |
Import and prepare data for chameleon pipelines |
chameleon-pipe |
Configure pipeline |
chameleon |
Run configured pipeline |
There are two subcommands: chameleon-data format
and chameleon-data kfold
.
Required:
Option | Argument | Description |
---|---|---|
--ftype |
STRING |
Type of data file to format. Choices are mat and arff . |
--fpath |
DIRECTORY |
Path to the raw data file. |
--name |
STRING |
Name of output files e.g. '--name foo' will make files foo_X.pkl and foo_y.npy. |
Required:
Option | Argument | Description |
---|---|---|
--xfile |
.PKL |
The name of the X data. e.g. '--Xfile proc_data/foo_X.pkl' |
--yfile |
.NPY |
The name of the y data. e.g. '--yfile proc_data/foo_y.npy' |
--name |
STRING |
Name of the output folder e.g. myproblem will create kfold_myproblem. |
Optional:
Option | Argument | Default | Description |
---|---|---|---|
--k |
INT |
5 | The number of folds to partition the data". |
--random_seed |
INT |
666 | Random state for assigning data to folds. |
--normalise/--no-normalise |
True | Whether to normalise the data. | |
--stratified/--not-stratified |
True | Whether to apply stratified kfold. |
There are three subcommands: chameleon-pipe pipe
, chameleon-pipe features
, and chameleon-pipe classifiers
.
All available feature selection methods and classifiers are added to the pipeline by default. This can be overriden by specifying individual choices using the --featureselector
/-f
and --classifier
/-c
flags. Each flag can be used multiple times.
Arguments for --featureselector
/-f
:
Argument | Description |
---|---|
fischer |
Fischer score |
reliefF |
reliefF |
random-forest |
Random forest feature importance |
SVM-RFE |
SVM recursive feature elimination |
simple_MI |
Simple mutual information score |
iterative_MI |
Iterative mutual information selection |
Arguments for --classifier
/-c
:
Argument | Description |
---|---|
naive-bayes |
Guassian Naive Bayes |
kNN |
k-nearest neighbours |
logistic-regression |
Logistic regession |
neural-net |
Neural network (multilayer perceptron) |
random-forest |
Random forest |
SVM |
Support vector machine (linear) |
Required:
Option | Argument | Description |
---|---|---|
--data /-d |
DIRECTORY |
The folder containing the prepared data e.g. kfold_myproblem. |
Optional:
Option | Argument | Default | Description |
---|---|---|---|
--featureselector /-f |
STRING |
Name of feature selection algorithms to add to the pipeline. | |
--classifier /-c |
STRING |
Name of classifier algorithms to add to the pipeline. | |
--name /-n |
STRING |
pipe | Name for the pipeline configuration file. |
Required:
Option | Argument | Description |
---|---|---|
--data /-d |
DIRECTORY |
The folder containing the prepared data e.g. kfold_myproblem. |
Optional:
Option | Argument | Default | Description |
---|---|---|---|
--featureselector /-f |
STRING |
Name of feature selection algorithms to add to the pipeline. | |
--name /-n |
STRING |
pipe | Name for the pipeline configuration file. |
Required:
Option | Argument | Description |
---|---|---|
--configpath /-p |
.JSON |
The path to the pipeline config file. |
Optional:
Option | Argument | Description |
---|---|---|
--classifier /-c |
STRING |
Name of classifier algorithms to add to the pipeline. |
There is one subcommand: chameleon run
.
Required:
Option | Argument | Description |
---|---|---|
--data /-d |
DIRECTORY |
The folder containing the prepared data e.g. kfold_myproblem. |
--pipe /-p |
.JSON |
The relative path to the pipe config file. |
Optional:
Option | Argument | Default | Description |
---|---|---|---|
--method /-m |
STRING |
normal | The method for running the program. Options are normal and slurm . |
--featureselection |
BOOL |
True | Whether to run feature selection. |
--predict |
BOOL |
True | Whether to run classification. |
--n_features /-n |
INT |
50 | Number of features to use in classifier predictions (the top 'n' features). |