Federated Learning for local networks (photo credit: Parth Sareen's expert prompt engineering talent)
This guide provides step-by-step instructions on how to set up and run a federated learning system with a central leader and multiple learners.
Before you begin, ensure you have the following requirements installed:
- Python 3.6+
- pip
- Access to a terminal or command line interface
Navigate to the project directory and install the required Python packages (preferrable in an environment:
pip install -r requirements.txt
Add your model.py
file to the model_artifacts
directory. This file should define the necessary variables including device
, model
, criterion
, and optimizer_function
. Check out my resnet-18 example model at model_artifacts/model.py If you want to change the model from Resnet to anything else, make sure you change it here: model = <MODEL-DEF-GOES-HERE>
Modify the data loader in utils.py
to suit your setup or continue using the default setup configured for CIFAR-10.
Ensure all computers (leader and learners) are connected to the same network to facilitate communication.
Run the leader script specifying the number of learners to wait for:
python leader.py --learner-count X
Replace X
with the number of learners you want (# of computers you want to run training on).
Once the leader is running, it will display its address in the format ADDRESS:PORT
. Use this address to start each learner:
python learner.py --leader-address ADDRESS:PORT
- After the specified number of learners have joined, training will start automatically.
- The system will train the model distributed across all learners and periodically synchronize their updates with the leader.
- Upon completion of the training, the validation accuracy will be printed, and a
model.pth
file will be generated. This file contains the trained model binary.
- The training session's progress and results will be logged in the terminal.
- For troubleshooting and detailed logs, refer to the log files generated in the
logs
directory.