/Federated-Learning

We present UDP-based aggregation algorithms for federated learning. We also present a scalable framework for practical federated learning. We empirically evaluate the performance by training deep convolutional neural networks on the MNIST dataset and the CIFAR10 dataset.

Primary LanguagePythonMIT LicenseMIT

Language Contributors Forks Stargazers Issues MIT License LinkedIn


Federated learning over WiFi: Should we use TCP or UDP?

Explore the repository»
View Paper

tags : distributed optimization, large-scale machine learning, edge learning, federated learning, deep learning, pytorch

About The Project

Federated learning is a distributed learning paradigm where a centralized model is trained on data distributed over a large number of clients, each with unreliable and relatively slow network connections. The client connections typically have limited bandwidth available to them when using networks such as 2G, 3G, or WiFi. As a result, communication often becomes a bottleneck. Currently, the communication between the clients and server is mostly based on TCP protocol. In this paper, we explore using the UDP protocol for the communication between the clients and server. In particular, we present UDP-based aggregation algorithms for federated learning. We propose FedGradUDP for gradient aggregation-based federated learning and FedAvgUDP for model aggregation-based federated learning. We also present a scalable framework for practical federated learning. We empirically evaluate the performance of FedGradUDP by training a deep convolutional neural network on the MNIST dataset and FedAvgUDP by training VGG16 deep convolutional neural network on the CIFAR10 dataset. We conduct experiments over WiFi and observe that the UDP-based protocols can lead to faster convergence than the TCP-based protocol -- especially in bad networks.

Built With

This project was built with

  • python v3.7.6
  • PyTorch v1.7.1
  • The environment used for developing this project is available at environment.yml.

Framework

The framework is developed on Python using the PyTorch library for learning and the socket library for communication.

TCP The central server is set up with a TCP socket and is run indefinitely. Whenever a client is ready to communicate, a connection is established between the client and server. A new thread is created at the server for each incoming client, and then the data is transmitted. The tensors (gradients/weights) are serialized before sending and are deserialized after receiving. The server aggregates all updates and sends them back to the clients, after which the connection is closed.

UDP The central server is set up with a UDP socket and is run indefinitely. We use to special messages start-of-transmission and end-of-transmission to signal the beginning and end of transmission. We send these special messages via TCP to ensure their delivery. This is necessary because, unlike TCP, UDP must signal the end of the transmission to the server. Whenever a client is ready to communicate, the client sends a start-of-transmission to the server. A new thread is created at the server for each incoming client, and each client is assigned a specific UDP port. The data transmission takes place through the assigned port alone, in parallel to other transmissions. The tensors (gradients/weights) are divided into smaller subvectors to fit within the maximum UDP datagram size. The starting index of each subvector is concatenated with the subvector to have the ordering of packets. They are serialized before sending and are deserialized after receiving. The server aggregates all updates and sends them back to the clients through the designated ports.

Getting Started

Clone the repository into a local machine using,

git clone https://github.com/vineeths96/Federated-Learning
cd Federated-Learning/

Prerequisites

Create a new conda environment and install all the libraries by running the following command

conda env create -f environment.yml

The dataset used in this project (CIFAR 10) will be automatically downloaded and setup in data directory during execution.

Instructions to run

The server has to be initiated before beginning the federated training. The server should have the capacity to connect and serve the expected number of clients.

To launch the server for distributed training,

 python trainer_server.py --world_size <num_clients>

To launch training on a client with a single workers (GPUs),

 python trainer.py --local_rank <client_rank> --world_size <num_clients>

Results

We highly recommend to read through the paper before proceeding to this section. The paper explains the UDP-based algorithms we propose and contains many more analysis & results than what is presented here.

We begin with an explanation of the notations used for the plot legends in this section. FedGradTCP corresponds to the default gradient aggregation-based federated learning using TCP. FedGradUDP corresponds to the FedGradUDP Algorithm. FedAvgTCP corresponds to the default model aggregation-based federated learning using TCP. FedAvgUDP corresponds to the FedAvgUDP Algorithm.

FedGradUDP FedAvgUDP
LossLoss Curve LossLoss Curve
AccuracyAccuracy Curve AccuracyAccuracy Curve
DPTime Breakdown TimeTime Breakdown

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Vineeth S - vs96codes@gmail.com

Project Link: https://github.com/vineeths96/Federated-Learning