TL-System/plato

[BUG] Asynchronous mode does not proceed in the initial round

Closed this issue · 1 comments

Describe the bug
When using the asynchronous mode, the server will not be able to proceed in the initial round, waiting forever for clients that will never arrive.

To Reproduce

Use the following configuration file with the ./run -c command:

clients:
    # Type
    type: simple

    # The total number of clients
    total_clients: 500

    # The number of clients selected in each round
    per_round: 50

    # Should the clients compute test accuracy locally?
    do_test: false

    # Whether client heterogeneity should be simulated
    speed_simulation: true

    # The distribution of client speeds
    simulation_distribution:
        distribution: pareto
        alpha: 1

    # The maximum amount of time for clients to sleep after each epoch
    max_sleep_time: 30

    # Should clients really go to sleep, or should we just simulate the sleep times?
    sleep_simulation: false

    # If we are simulating client training times, what is the average training time?
    avg_training_time: 20

    random_seed: 1

server:
    address: 127.0.0.1
    port: 8000

    ping_timeout: 36000
    ping_interval: 36000

    # Should we operate in sychronous mode?
    synchronous: false

    # Should we simulate the wall-clock time on the server? Useful if max_concurrency is specified
    simulate_wall_time: true

    # What is the minimum number of clients that need to report before aggregation begins?
    minimum_clients_aggregated: 15

    # What is the staleness bound, beyond which the server should wait for stale clients?
    staleness_bound: 10

    # Should we send urgent notifications to stale clients beyond the staleness bound?
    request_update: false

    random_seed: 1

data:
    # The training and testing dataset
    datasource: MNIST 

    # Number of samples in each partition
    partition_size: 600 

    # IID or non-IID?
    sampler: noniid

    # The concentration parameter for the Dirichlet distribution
    concentration: 0.5

    # The random seed for sampling data
    random_seed: 1

trainer:
    # The type of the trainer
    type: basic 

    # The maximum number of training rounds
    rounds: 5

    # The maximum number of clients running concurrently
    max_concurrency: 3

    # Number of epoches for local training in each communication round
    epochs: 1
    batch_size: 32
    optimizer: SGD
    learning_rate: 0.01
    momentum: 0.9
    weight_decay: 0.0

    # The machine learning model
    model_name: lenet5 

algorithm:
    # Aggregation algorithm
    type: fedavg

Just want to add that, if running on CPU, the config file mentioned above will work if the number of clients selected per round (i.e. per_round attribute under clients) is less than or equal to max_concurrency . If on GPU, the number of clients per round should be less than the number of GPU used for training.

Thanks.