This repository is the official implementation of Efficient Learning of Minimax Risk Classifiers in High Dimensions.
The algorithm proposed in the paper provides efficient learning for 0-1 minimax risk classifiers (MRCs) in high dimensions based on constraint generation. In particular, the iterative algorithm utilizes greedy feature selection to obtain a sequence of MRCs with decreasing worst-case error probability. Such performance assessment at learning provides a trade-off between training time and approximation error.
The standard libraries required are listed in the file requirements.txt
. To install these libraries using
pip
pip install -r requirements.txt
conda
environment
conda create --name <environment_name> --file requirements.txt
The code also requires the MRCpy library which can be installed from this repository. In addition, the implementation of the proposed algorithm utilizes the GUROBI optimizer for which license can be downloaded from here.
The repository contains multiple high-dimensional biological datasets to perform experiments as follows -
Dataset | Variables | Samples | Classes |
---|---|---|---|
Arcene | 10000 | 200 | 2 |
Colon | 2000 | 62 | 2 |
CLL_SUB_111 | 11340 | 111 | 3 |
Dorothea | 100000 | 1150 | 2 |
GLI_85 | 22283 | 85 | 2 |
GLIOMA | 4434 | 50 | 4 |
Leukemia | 7129 | 72 | 3 |
Lung | 12600 | 203 | 5 |
MLL | 12582 | 72 | 3 |
Ovarian | 15154 | 253 | 2 |
Prostate_GE | 5966 | 102 | 2 |
SMK_CAN_187 | 19993 | 187 | 2 |
TOX_171 | 5748 | 171 | 4 |
The datasets are available as zip files in this repository in the Datasets
folder with functions to easily load them an numpy matrices in load.py
.
To reproduce the results in the paper, use the following python scripts:
- param_eps.py : Experiments to study the influence of epsilon parameter and show the monotonic decrease in the worst-case error probability.
- param_n_max.py : Experiment to analyze the effect of the parameter n_max and show improved efficiency over MRC-LP.
- scalability.py : Experiments to show the scalability of the different algorithms with increasing number of features.
- comparison.py : Experiments to compare our method with the state of the art techniques in terms of error and training time.
- feature_selection.py : Experiment to compare our method as feature selection approach. The experiment is performed for binary classification datasets.
The implementation of the proposed algorithm along with the LP formulation of MRCs can be found in Main
folder. The code for the SVM-CG can be found in Libraries
folder which is forked from this respository.
To reproduce any of the experimental results for any dataset, run the following command -
python <scriptname> <dataset>
The result corresponding to that experiment and dataset will be saved in the Results/
folder in the respective folder of the experiment and the dataset.
For instance, the experiment for comparing the scalability of different algorithms for the Ovarian
dataset can be performed as follows -
python3 scalability.py Ovarian
and the corresponding results will be saved in Results/Scalability/Ovarian/
as CSV.
The proposed algorithm is also implemented in the MRCpy library which can be installed from here. After the installation, one can use the proposed algorithm through the library as follows
from MRCpy import MRC
# fitting
mrc_classif = MRC(loss = '0-1, solver = 'cg', n_max = 100, k_max = 20, eps = 1e-4)
mrc_classif.fit(X,y)
# prediction
predicted_labels = mrc_classif.predict(X)
For further details, please check out the documentation of the library in here
Kartheek Reddy Bondugula
Santiago Mazuelas
Aritz Pérez
If you find useful the code in your research, please include explicit mention of our work in your publication with the following corresponding entry in your bibliography:
[1] K. Bondugula, S. Mazuelas, A. Pérez "Efficient Learning of Minimax Risk Classifiers in High Dimensions." The 39th Conference on Uncertainty in Artificial Intelligence. PMLR, 2023.
The corresponding BiBTeX citation is given below:
@InProceedings{BonMazPer23, title = {Efficient Learning of Minimax Risk Classifiers in High Dimensions}, author = {Bondugula, Kartheek and Mazuelas, Santiago and P'{e}rez, Aritz}, booktitle = {The 39th Conference on Uncertainty in Artificial Intelligence}, pages = {206--215}, year = {2023}, volume = {216}, series = {Proceedings of Machine Learning Research}, publisher = {PMLR}}