Instance Weighting through Data Imprecisiation
This repository provides the Python implementation of the paper "Instance Weighting through Data Imprecisiation" by Julian Lienen and Eyke Hüllermeier published in International Journal of Approximate Reasoning (IJAR), 2021. Please cite this work as follows:
@article{DBLP:journals/ijar/LienenH21,
author = {Julian Lienen and
Eyke H{\"{u}}llermeier},
title = {Instance weighting through data imprecisiation},
journal = {Int. J. Approx. Reason.},
volume = {134},
pages = {1--14},
year = {2021},
url = {https://doi.org/10.1016/j.ijar.2021.04.002},
doi = {10.1016/j.ijar.2021.04.002},
timestamp = {Thu, 29 Jul 2021 13:39:54 +0200},
biburl = {https://dblp.org/rec/journals/ijar/LienenH21.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Getting started
Dependencies
The code uses the following dependencies (Python 3.7.x):
- Numpy
- Scipy
- Cvxpy (with Cvxopt solver)
- Mlflow
- Ray
- Scikit-learn
A detailed list (including version for reproducibility) can be found in requirements.txt
. To install all dependencies, run the following command:
pip install -r requirements.txt
Note that due to the use of ray, Windows systems are not supported. To run the framework on Windows-based systems, we recommend to use the Windows Linux Subsystem.
Basic workflow
This framework relies on Mlflow to track experimental results. To do so, it logs parameters
and resulting metrics for each run, such that they can be analyzed by Mlflow's UI or using SQL scripts. However, the
program itself also logs the results with the Python logging
framework, such that you can still access results without
getting familiar with Mlflow, although it is not convenient on a large scale.
Run experiments
In the original paper, two experimental settings are considered: Robust binary classification and (semi-supervised) self-training.
To run the first settings, one has to execute the following statement:
python data_imp/exp/model_exp.py <model_name> <dataset_id> <seed> <noise_level> [<debug_level> <config_path>]
The model name can be one of SVM (non-regularized), SVMReg (L2 regularized), RSVM, OWSVM or FLSVM (which is our SVM-DI). As dataset id, you can pass any OpenML ID
you want. The seed must be an integer number, while the noise level is a float value in [0,1]. A debug level (integer)
higher than 0 indicates logging debug outputs with increasing verbosity. The configuration path specifies a configuration
file (see next subchapter) relative to the project root path. If the debug level and the configuration path are not
specified, a default debug level of 0
and the default configuration file conf/run.ini
is used.
For the second experiment, a run can be started by
python data_imp/exp/sem_sup.py <model_name> <dataset_id> <seed> <unlabeled_fraction> [<debug_level> <config_path>]
Here, the model name is one of SSWSVM (weighted SVM) or SSFLSVM (our SVM-DI). The unlabeled fraction is also a float parameter between [0,1], while the other parameters match those used within the first setting.
Configuration path
For both scenarios, a configuration file path has to be specified. It uses the default Python configparser
and contains
the following entries:
[MLFLOW]
MLFLOW_TRACKING_URI = <mlflow_tracking_uri>
[EXPERIMENTS]
NUM_CORES = 4
NUM_OUTER_FOLDS = 5
NUM_INNER_FOLDS = 5
EXP_PREFIX = <mlflow_experiment_prefix_for_first_setting>
SEMSUP_EXP_PREFIX = <mlflow_experiment_prefix_for_second_setting>
If Mlflow should not be used, just comment out the property MLFLOW_TRACKING_URI
.
Parameter search spaces
To specify which parameters are tuned within the internal hyperparameter optimization, the file in
exp/search_spaces.json
can be modified. This provides a generic syntax to specify both ranges and fixed parameters.