The concept of ML model aggregation rather than data aggregation has gained much attention as it boosts prediction performance while maintaining stability and preserving privacy. In a non-ideal scenario, there are chances for a base model trained on a single device to make independent but complementary errors. To handle such cases, in this repo, we implement 8 robust ML model combining methods that achieves reliable prediction results by combining numerous base models (trained on many devices) to form a central model that effectively limits errors, built-in randomness and uncertainties.
The contributions of this work can be summarized as follows:
-
The studies from centralized learning, split learning, distributed ensemble learning have extensively investigated combining models trained on devices like smartphones, Raspberry Pis, Jetson Nanos, etc. Such devices have sufficient resources to train base models (or ensembles) using standard training algorithms from Python Scikit-learn or light version ML frameworks like TensorFlow Lite. In contrast, we aim to achieve collective intelligence using MCUs, since billions of deployed IoT devices like HVAC controllers, smart meters, video doorbells have resource-constrained MCU-based hardware with only a few MB of memory.
-
From the available multitudinous number of studies, we choose, implement, and provide 8 robust ML model combining methods that are compatible with a wide range of datasets (varying feature dimensions and classes) and IoT devices (heterogeneous hardware specifications). We open-source the implementation, utilizing which researchers and engineers can start practicing distributed ensemble learning by combining ML base models trained on ubiquitous IoT devices.
- Medical Data Usecase
- Algorithms for Combining ML Models
- Devices and Datasets for Experiments
- Experiments: Distributed Train then Combine
- Useful Books, Toolboxes and Datasets
- Classic Papers
- Source and Ranking Portals
- Reputed Data Mining Conferences/Workshops/Journals
- Doing Good Research and Get it Published
The Providing sensitive medical data for research use case can be a potential application where Combining ML Models can be utilized.
The data required for most research are sensitive in nature, as it revolves around a private individual. So, GDPR restricts sending such sensitive yet valuable medical data (from hospitals, imaging centers) to research institutes. As shown in above Fig, when the resource-constrained medical devices like insulin-delivery devices, BP apparatus are equipped with IoT hardware-friendly training algorithms like ML-MCU, or Train++, they can perform onboard training of base models, even without depending on the hospital’s local servers. After training, the base models from similar devices can be extracted, combined, and sent to research labs with improved data privacy preservation. For example, the 2 base models M71, M72 (see above Fig) trained on ECG monitors using vital data of patients can be combined centrally, then shared for research.
To enable combining ML models rather than combining distributed data, we select, implement and provide 8 robust methods that apply to a variety of IoT use-case data while also suitable for combining models trained on heterogeneous IoT devices.
- Python 3.5, 3.6, or 3.7
- joblib
- matplotlib (optional for running examples)
- numpy>=1.13
- numba>=0.35
- pyod
- scipy>=0.19.1
- scikit_learn>=0.20
Devices: Distributed, ubiquitous IoT Devices in the real world have heterogeneous hardware specifications. To replicate this scenario, the devices chosen to carry out the distributed training, given in below Table, contains 10 resource-constrained MCU boards (B1-B10) and 5 CPU devices (C1-C5).
Board#: Name | Specs: Processor flash, SRAM, clock (MHz) | |
---|---|---|
B1: nRF52840 Feather | Cortex-M4, 1MB, 256KB, 64 | |
B2: STM32f10 Blue Pill | Cortex-M0, 128kB, 20KB, 72 | |
B3: Adafruit HUZZAH32 | Xtensa LX6, 4MB, 520KB, 240 | |
B4: Raspberry Pi Pico | Cortex-M0+, 16MB, 264KB, 133 | |
MCUs | B5: ATSAMD21 Metro | Cortex-M0+, 256kB, 32KB, 48 |
B6: Arduino Nano 33 | Cortex-M4, 1MB, 256KB, 64 | |
B7: Teensy 4.0 | Cortex-M7, 2MB, 1MB, 600 | |
B8: STM32 Nucleo H7 | Cortex-M7, 2MB, 1MB, 480 | |
B9: Feather M4 Express | Cortex-M4, 2MB, 192KB, 120 | |
B10: Arduino Portenta | Cortex-M7+M4, 2MB, 1MB, 480 | |
CPU#: Name | Basic specs | |
C1: W10 Laptop | Intel Core i7 @1.9GHz | |
C2: NVIDIA Jetson Nano | 128-core GPU @1.4GHz | |
CPUs | C3: W10 Laptop | Intel Core i5 @1.6GHz |
C4: Ubuntu Laptop | Intel Core i7 @2.4GHz | |
C5: Raspberry Pi 4 | Cortex-A72 @1.5GHz |
Datasets: Below datasets are used for training on the above MCUs and CPUs.
- Banknote Authentication (5 features, 2 classes, 1372 samples)
- Haberman's Survival (3 features, 2 classes, 306 samples)
- Titanic (11 features, 2 classes, 1300 samples)
The training process on all 15 devices is carried out using the resource-friendly classifier training algorithm from ML-MCU.
Initially, for the Banknote dataset, upon all devices completing the training, 15 base models are obtained (first set). Then, each of the 8 ML model combining methods are one by one applied on this first set of models, producing 8 central models (one central model as an output of each combining method). A similar procedure was followed for the remaining datasets, producing the second and third set of models, followed by model combining. At this stage, there are 8 central models for each dataset, whose performance was evaluated in terms of Accuracy, ROC, and F1 score (F1) metrics and reported in below Fig.
Here, using the below Fig, performance of combined central models are analyzed.
Banknote Authentication dataset: The highest performance is shown by the Dynamic Classifier Selection (DCS-LA) method. Followed by Maximization, then the Median combination method, where both show the same accuracy and slightly different ROC and F1. The Simple Averaging, Weighted Averaging, and the Weighted Majority Vote (WMV) methods achieve similar performance. The combine by Stacking is the least performing, followed by Dynamic Ensemble Selection (DES) method.
Haberman's Survival dataset: Again, DCS-LA showed the top performance. The DES and Stacking methods that produced a low performance for the previous dataset are the second and third best-performing methods. The other algebraic, averaging, and voting methods perform almost the same, achieving good accuracy and F1 but low ROC.
Titanic dataset: Stacking shows the highest accuracy, but DES achieved slightly higher ROC and F1 so, DES is the overall top-performing method. Unlike in previous datasets, here, the algebraic (combine by Maximization and Median), Averaging, and Voting methods show varying performance. From the algebraic methods, the combine by Median performed better. From averaging methods, Simple Averaging performed better.
The following observation were made during experimentation:
- The computational cost for creating an ensemble is not much higher than training a single base model. It is because multiple versions of the base model need to be generated during parameter tuning. Also, the computational cost for combining multiple IoT devices trained base models was small due to the simplicity of the presented combination strategies.
- To construct a good ensemble, it is recommended to create base models as accurate and as diverse as possible.
- Creating a learning algorithm that is consistently better than others is a hopeless daydream. i.e., from above Fig, Stacking shows top performance for the Titanic dataset and least in the Banknote dataset.
-
Ensemble Methods: Foundations and Algorithms: Classical text book covering most of the ensemble learning techniques. A must-read for people in the field
-
Ensemble Machine Learning: Methods and Applications: Responding to a shortage of literature dedicated to the topic, this volume offers comprehensive coverage of state-of-the-art ensemble learning techniques, including various contributions from researchers in leading industrial research labs.
-
Applications of Supervised and Unsupervised Ensemble Methods: This book contains the extended papers presented at the 2nd Workshop on Supervised and Unsupervised Ensemble Methods and their Applications (SUEMA), in conjunction with ECAI.
-
Data Mining and Knowledge Discovery Handbook Chapter 45 (Ensemble Methods for Classifiers): This chapter provides an overview of ensemble methods in classification tasks. We present all important types of ensemble method including boosting and bagging. Combining methods and modeling issues such as ensemble diversity and ensemble size are discussed.
-
Outlier Ensembles: An Introduction: Great intro book for ensemble learning in outlier analysis.
-
combo: combo is a comprehensive Python toolbox for combining machine learning (ML) models and scores for various tasks, including classification, clustering, and anomaly detection. It supports the combination of ML models from core libraries such as scikit-learn and xgboost.
-
pycobra: python library implementing ensemble methods for regression, classification and visualisation tools including Voronoi tesselations.
-
DESlib: A Python library for dynamic classifier and ensemble selection.
-
imbalanced-learn: A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning.
As a subfield of machine learning, ensemble learning is usually tested against general machine learning benchmark datasets. Some helpful links can be found below:
- List of datasets for machine-learning research - Wikipedia
- UCI Machine Learning Repository
- PMLB: a large benchmark suite for machine learning evaluation and comparison: Dataset Repository
- Ensemble methods in machine learning @MCS. PDF
- Popular ensemble methods: An empirical study @JAIR. PDF
- Ensemble learning: A survey @ Wiley Interdisciplinary Reviews. PDF
- Xgboost: A scalable tree boosting system @ KDD. PDF
- Lightgbm: A highly efficient gradient boosting decision tree @ NIPS. PDF
- CatBoost: unbiased boosting with categorical features @ NIPS. PDF
- Cluster Ensembles – A Knowledge Reuse Framework for Combining Multiple Partitions @ JMLR. PDF
- Clusterer Ensemble @ KBS. PDF
- A survey of clustering ensemble algorithms @ IJPRAI. PDF
- Clustering ensemble method @ Cybernetics. PDF
- Outlier ensembles: position paper @ SIGKDD Explorations. PDF
- Ensembles for unsupervised outlier detection: challenges and research questions a position paper @ SIGKDD Explorations. PDF
- Isolation forest @ ICDM. PDF
- Outlier detection with autoencoder ensembles @ SDM. PDF
- An Unsupervised Boosting Strategy for Outlier Detection Ensembles @ PAKDD. PDF
- LSCP: Locally selective combination in parallel outlier ensembles @ SDM. PDF
- A survey on ensemble learning for data stream classification @ ACM Computing Surveys. PDF
- Ensemble learning for data stream analysis: A survey @Information Fusion. PDF
- Bagging predictors @Machine Learning. PDF
- A decision-theoretic generalization of on-line learning and an application to boosting @JCSS. PDF
- Bagging, Boosting @AAAI/IAAI. PDF
- Stacked generalization @Neural Networks. PDF
- Stacked regressions @Machine Learning. PDF
-
ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD)
-
ACM InternationalConference on Information and Knowledge Management (CIKM)
-
ACM International Conference on Web Search and Data Mining (WSDM)
-
The Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD)
How to do good research, Get it published in SIGKDD and get it cited: A fantastic tutorial on by Prof. Eamonn Keogh (UC Riverside)
Checklist for Revising a SIGKDD Data Mining Paper: A concise checklist by Prof. Eamonn Keogh (UC Riverside)
How to Write and Publish Research Papers for the Premier Forums in Knowledge & Data Engineering: A tutorial on how to structure data mining papers by Prof. Xindong Wu (University of Louisiana at Lafayette)