AIoT-MLSys-Lab/FedRolex

Question about the randomness of experiment

Closed this issue · 1 comments

Hi, @samiul272 , I run your code with the following command
python main_resnet.py --data_name CIFAR10 --model_name resnet18 --control_name 1_100_0.1_non-iid-2_fix_a1-b1-c1-d1-e1_bn_1_1 --exp_name roll_test --algo roll --g_epoch 3200 --l_epoch 1 --lr 2e-4 --schedule 1200 --seed 31 --num_experiments 3 --devices 0 1 2 3 4
and I set cfg['shuffle']['train']=False, cfg['shuffle']['test']=False. I didn't change any random seeds, but when I run the code twice, I found that global model had inconsistent accuracy on the test set. I intercepted the results of the first 6 rounds of the two runs, as shown below. I wanted to know if there was any way to make them consistent. Is this randomness due to the use of the ray framework?

the first run result

Test Epoch: 1(100%) Local-Loss: 2.2380 Local-Accuracy: 45.0000 Global-Loss: 2.3141 Global-Accuracy: 16.2200
Test Epoch: 2(100%) Local-Loss: 2.1790 Local-Accuracy: 43.6000 Global-Loss: 2.3342 Global-Accuracy: 12.0700
Test Epoch: 3(100%) Local-Loss: 2.1406 Local-Accuracy: 46.8000 Global-Loss: 2.3556 Global-Accuracy: 8.7900
Test Epoch: 4(100%) Local-Loss: 2.0628 Local-Accuracy: 54.2000 Global-Loss: 2.3327 Global-Accuracy: 10.7000
Test Epoch: 5(100%) Local-Loss: 2.0630 Local-Accuracy: 49.8000 Global-Loss: 2.3717 Global-Accuracy: 10.0100
Test Epoch: 6(100%) Local-Loss: 2.0313 Local-Accuracy: 49.0000 Global-Loss: 2.3643 Global-Accuracy: 13.5500

the second run result

Test Epoch: 1(100%) Local-Loss: 2.2269 Local-Accuracy: 51.7000 Global-Loss: 2.3072 Global-Accuracy: 17.6600
Test Epoch: 2(100%) Local-Loss: 2.1619 Local-Accuracy: 51.5000 Global-Loss: 2.3470 Global-Accuracy: 11.0700
Test Epoch: 3(100%) Local-Loss: 2.1262 Local-Accuracy: 48.0000 Global-Loss: 2.3466 Global-Accuracy: 9.4800
Test Epoch: 4(100%) Local-Loss: 2.0643 Local-Accuracy: 55.1000 Global-Loss: 2.3411 Global-Accuracy: 10.0100
Test Epoch: 5(100%) Local-Loss: 2.0296 Local-Accuracy: 49.6000 Global-Loss: 2.3790 Global-Accuracy: 9.6500
Test Epoch: 6(100%) Local-Loss: 1.9768 Local-Accuracy: 49.4000 Global-Loss: 2.3819 Global-Accuracy: 9.7700

Hi, @samiul272 , I have solved this problem by simply resetting the seeds in the child processes of each client.