/FEDMCCS-Replication

Replicating FEDMCCS client selection in Federated Learning

Primary LanguagePython

FEDMCCS-Replication

Replicating FEDMCCS client selection in Federated Learning

The paper

The paper is in the link : FEDMCCS

Running

At first run the python3 generate_clients.py then run the runner.sh with bash or sh.

If you get status code 137 it's likely caused by OOM killer. See /var/log/kern.log.

Client

Dataset

To create non-IID data, we use the same method as here; We simply split the dataset in random chunks and give each client each chunk.

To replicate the effect of ever expanding dataset, we at first start with half of the dataset and increase the size of it by 5% in each round. This continues until we reach the full dataset size. From then, the expansion stops.

Metrics

Metrics are stored in client as well as server. When server requests a train, client spawns a watcher thread in order to record the frequency and memory usage while the training is in process. This is done via psutil program. These metrics are averaged and stored in client. When server requests the data from client, the data of the last round will be returned.

Training

Training is done on MNIST dataset. Validation is done on the whole test dataset of MNIST (instead of a portion of it).

Data transmission

Server can use client.get_properties in order to get the properties needed for client selection.

Proxy Meter

Another piece of software is the proxy-meter. This tool is designed to simulate bandwidth and ping and report bandwidth usage. The usage follows:

--listen ADDRESS: On what port and interface should proxy listen on?
--forward ADDRESS: Where should data be forwarded?
--ping TIME: The simulated ping. Defaults to zero.
--speed NUMBER: The speed of the data transfer. Defaults to zero. Use zero for unlimited. Units are bytes per second.

For each client, a separate proxy meter will be spawned. You can turn this feature off by using generate_compose_file_direct function in generate_clients.py file.

Server

The server runs on 0.0.0.0:8080 on the linux server.
Once the first clinet connects to the server, it starts the procedure. The overall server code is similar to Flower Base Server with changes to clientManager.
myClientManager class is a class which inherits fl.server.SimpleClientManager. We need to override the sample method defined in the parent class and used for client selection at the beginnig of each evaluation and training round.
The method, first check if there are any clients available and if there were enough clients, it would proceed to select the clients according to a criteria.
In the first few rounds, all of the clients are given a round so that the server could have enough data for the linear regression. RANDOM_ROUNDS is a variable which determines how much rounds all the clients are trained so we have a fair amount of data.
After some rounds, clients are selected according to the linear regression method. It gets a list of all available clinets and first checks their historical data in which there are 4 parameters :

  • Used Cores
  • Used Frequencies
  • Used Memories
  • Training Times
  • Used Dataset Sizes
  • predict utilization predicts the next rounds Cores, Frequencies, Memories and Training Times according to the dataset size which is going to be used.
    If the wanted number of clients are selected according to CLIENT FRACTION, there will be no training.
    After the clients are selected, we might have a problem : Less clients are selected because of the budget limit. So to mitigate this impact, if less clients are selected in the linear regression, the remainder of selected clients are randomly selected. This ensures convergence.

    Test Setup

    HPC Server

  • Ubuntu 22.04
  • Linux Kernel 5.15
  • Docker 24.0.5
  • CPU : AMD EPYC 7763 64-Core Processor
  • CPU Cores : 12
  • Client Configurations :

    Client CPU Cores Cpu Utilization Memory Limit(MB) Ping Latency(ms) Bandwidth (Mbps)
    Client1 1 0.01 250 100 512
    Client2 1 0.01 150 100 512
    Client3 1 1 200 no latency 1024
    Client4 1 0.01 150 100 512
    Client5 1 1 100 100 512
    Client6 1 1 50 50 512
    Client7 1 0.01 100 100 512
    Client8 1 0.8 No Limit no latency 1024
    Client9 1 0.01 100 100 512
    Client10 1 1 100 100 512
    Client11 1 0.01 100 100 512

    Resource Budget

    Resource Budget
    Memory Budget 0.8
    CPU Budget 1
    Energy Budget 1
    Time Threshold(s) 20

    Other Parameters

    Parameter
    Client Fraction 0.5
    Number of Rounds 15
    Random Rounds (Before Regression) 5
    Dataset MNIST
    Model CNN

    Test results

    Round Duration

    Round Duration

    Round FEDMCCS Random Select
    round1 19.2 13.5
    round2 49.4 43
    round3 79.6 72.2
    round4 109.7 101.5
    round5 139.7 130.8
    round6 169.6 160
    round7 191.5 189.3
    round8 203.3 214.3
    round9 219.3 229.9
    round10 241.2 259.1
    round11 356.4 373.6
    round12 1329.4 1519.6
    round13 2268.4 2662.6
    round14 3183.4 3889.6
    round15 4194.4 5120.6

    Convergence

    FEDMCCS and Random

    FEDMCCS and Random (1)

    Time FEDMCCS Time Random
    19.2 0.33 13.5 0.25
    49.4 0.48 43 0.44
    79.6 0.56 72.2 0.58
    109.7 0.63 101.5 0.65
    139.7 0.7 130.8 0.72
    169.6 0.75 160 0.76
    191.5 0.79 189.3 0.8
    203.3 0.81 214.3 0.82
    219.3 0.83 229.9 0.84
    241.2 0.84 259.1 0.85
    356.4 0.85 373.6 0.85
    1329.4 0.87 1519.6 0.86
    2268.4 0.868 2662.6 0.87
    3183.4 0.873 3889.6 0.87
    4194.4 0.877 5120.6 0.87

    Conclusions

    This project implemented FEDMCCS which is a federated learning algorithm that selects clients based on a learning method.

    Advantages

  • Involving Better clients
  • Learning the
  • Satisfying Resource Constraints
  • Disadvantages

  • Might Give Sub-Optimal Results
  • Fairness is unforseen
  • If a client is not selected, there is a chance it would not be selected for the rest of the process