This repository contains the code for reproducing the results presented in the paper titled "MatchCurv: Communication-Efficient Decentralized Federated Learning in Heterogeneous Environments", accepted at EMERGE'24: 1st Workshop on Enabling Machine Learning Operations for next-Gen Embedded Wireless Networked Devices held in conjunction with International Conference on Embedded Wireless Systems and Networks, EWSN (December 2024).
Federated learning enables privacy-preserving training of machine learning models but often relies on a centralized setup, leading to risks like single points of failure and communication bottlenecks. MatchCurv is a decentralized federated learning framework designed to overcome these limitations by leveraging MATCHA, FedCurv, and a time-based deadline to improve communication efficiency while managing both statistical and system heterogeneity.
Make sure to install all the required libraries with
pip install -r requirements.txt
-
_graph_presets: This directory holds graph presets used for experiments to simulate various random topologies. These presets are essential for both simulations and running experiments on Raspberry Pi devices. Different graph presets represent different topologies; be sure to specify the one used in the config file.
-
_model_presets: Similar to graph presets, this directory contains initial model weights. While specifying a model preset is not mandatory, you can do so in the config file if required.
-
_random_presets: This directory contains random presets, which replace the default pseudo-random generator. These presets are necessary for the successful execution of experiments.
-
configs: The
configs
folder and its subfolders store configuration files for experiments. To run an experiment, create a subfolder in this directory and generate one or more JSON files following the existing examples. Each config file in a subfolder represents an experiment with its specific settings. Additionally, there is an option to define a search space. For instance, if you need to find an optimal learning rate, create a JSON file named_search_space.json
and specify the parameters and their corresponding values to be searched. -
datasets: This directory contains samples from the MNIST and Fashion-MNIST datasets. Each dataset's samples are organized into individual files and are extracted during the "preheat" step. The specific dataset used in an experiment is specified in the config file.
-
results: The results of each experiment are saved in this folder, with each experiment having its designated subfolder based on the config file.
-
src:
-
dashboard: This subdirectory contains the code for the project's dashboard. It requires the database certificate for operation. To track the training progress start a live server at the location of the html file and open it.
-
matchcurv: Inside this subdirectory, you'll find the code for the framework. Detailed descriptions of each script can be found within the individual files.
-
[1] Create a folder for your experiment configs:
> cd configs
> mkdir experiment1
> cd experiment1
[2] Create one or more config files as needed, and populate them using existing configs as a reference. You can also create a _search_space.json
file if required:
> nano config1.json
> nano config2.json
> nano _search_space.json
Here's an example of a config file:
{
"is_simulation": "True",
"ip": "",
"tcp_port": 4000,
"udp_port": 4001,
"clear_logs": "True",
"ignore_logger": "False",
"ignore_database": "True",
"sync_delay": 10
}
Set is_simulation
to True
if the code is running inside a simulation, otherwise False
.
Fill the ip
with the device ip if running on PIs, otherwise leave empty.
[3] Example of _search_space.json
: In this example, we are searching for curv_term
and random_preset
. Each experiment will replace the curv_term
and random_preset
in the config file with values specified in the search space, resulting in multiple experiments:
{
"curv_term" : [
[0.01, 0.1, 1],
["0_d_01", "0_d_1", "1"]
],
"random_preset" : [
[
"../../_random_presets/random_preset_1.b",
"../../_random_presets/random_preset_2.b"
],
["preset1", "preset2"]
]
}
So the total number of experiments would be 3 curv_terms times 2 random_presets times 2 config files, equals to 12. The subsequent list after the search space values is the aliases for the values used in creating the results file.
[4] Go back to the directory containing the Simulation.py
script and edit it to specify the path to your config file(s):
> cd ../../src/matchcurv
> nano Simulation.py
Specify the path to your config file(s) in the locations
list:
if __name__ == "__main__":
'''
This script is run for starting the simulation.
'''
CONFIG_PATH = "../../configs/"
RESULTS_PATH = "../../results/"
LOGS_PATH = "../../logs/"
locations = [
"experiment1/"
]
[5] When everything is configured, run the experiment:
> python Simulation.py
Steps 1-4 same as above.
[5] When everything is configured, run the experiment:
> python Device.py path/to/config -1 120 (a few seconds for the process to start, for e.g., 120)
Simulation Settings:
- Set the
is_simulation
field toTrue
if the program is running in simulation mode, otherwise set it toFalse
. - If the program is running on Raspberry Pi devices, you must specify the
ip
address of the device. - Ensure that all devices across the network use the same
tcp_port
andudp_port
settings.
{
"is_simulation": true,
"ip": "",
"tcp_port": 4000,
"udp_port": 4001,
Dashboard-Related Settings (Not Needed):
These settings are related to the dashboard and are not required for normal operation.
"certificate": "!!!REDACTED!!!",
"database_url": "!!!REDACTED!!!",
Graph Decomposition and Activation Settings:
- To perform graph decomposition into matchings, leave the
decomposition
field as"matcha"
, which is the default value and should not be changed. - To specify the method for computing activation probabilities, set the
activations
field to either"random"
or"matcha"
. - You can control the communication budget using the
comm_budget
field, which should be set to a value between 0 and 1.
"decomposition": "matcha",
"activations": "random",
"comm_budget": 0.25,
Introducing System Heterogeneity:
- To introduce system heterogeneity, specify the percentage of straggler devices using the
stragglers
field. - You can also specify the lag time for straggler devices using the
lag
field. - Set the
num_seconds
field to a value greater than the lag time. In experiments, it's typically set to twice the lag time, assuming stragglers perform half the number of epochs.
"stragglers": 0.5,
"lag": 10,
"num_epochs": -1,
"num_seconds": 20,
If estimating the lag and num_seconds is tricky, an alternative approach is provided. In this case, set lag
to 1
, num_seconds
to 1
and num_epochs
to the desired value. Then uncomment the following code section in Device.py
:
'''
Train the model.
'''
self.model.train(
num_seconds = -1,
num_epochs = num_epochs // (2 if self.straggler else 1),
batch_size = self.config["batch_size"]
)
Model Parameters:
- Specify the model type as either
"MLP"
or"LeNet5"
in themodel
field. - Use the
model_args
field to specify the number of units and layers. - Configure other model-related settings such as
num_rounds
,num_epochs
.
The l2_term
is the L2 regularization term, prox_term
is the constant for FedProx and is deprecated. Finally, curv_term
is set to a non zero value to use FedCurv penalty.
"model": "MLP",
"model_args": [128, 128],
"num_rounds": 20,
"num_epochs": 50,
"num_seconds": -1,
"batch_size": 128,
"input_shape": [28, 28, 1],
"num_outputs": 10,
"learning_rate": 0.01,
"l2_term": 0,
"prox_term": 0,
"curv_term": 0,
Statistical Heterogeneity Configuration:
To introduce statistical heterogeneity by distributing the dataset among devices:
- Specify the paths to training and testing files.
- Set
label_distribution
to indicate how many labels/classes each device includes samples from. - Adjust
sample_distribution
to specify the percentage of samples from assigned labels distributed among devices.
"train_files": "../../datasets/mnist/train/",
"test_files": "../../datasets/mnist/test/",
"label_distribution": 2,
"sample_distribution": 0.5,
Experiment Presets:
Specify the paths to presets used in experiments:
"random_preset": "../../_random_presets/random_preset_2.b",
"graph_preset": "../../_graph_presets/sparse_graph.b",
"model_preset": "../../_model_presets/mlp_128_128_preset_1.b",
Other Parameters:
logs_location
: Specify the location for storing logs.results_location
: Specify the location for storing results.stdout
: Set toFalse
to prevent program outputs from flooding the screen during simulation. For Raspberry Pi deployment, you can set it toTrue
.clear_logs
: Set toTrue
to clear existing logs.ignore_logger
: Set toFalse
to enable logging.ignore_database
: Set toTrue
to ignore database-related operations.sync_delay
: Set the delay (in seconds) forSimulation.py
to spawn all subprocesses before starting training. Adjust as needed based on the number of processes.
"logs_location": "../../logs/final/",
"results_location": "../../results/final",
"stdout": "False",
"clear_logs": "True",
"ignore_logger": "False",
"ignore_database": "True",
"sync_delay": 10
}