Simulating DDP training with the presence of malicious workers that inject trivial noise into their local gradients during each iteration.
pip install -r requirements.txt
python main.py -h
usage: main.py [-h] [-d] [--arch] [-e] [-gb] [-w] [-f [...]] [-df] [--proc] [-tb] [--device] [--correct] [--data_coll_path]
DDP simulation
optional arguments:
-h, --help show this help message and exit
-d , --dataset Dataset to use
--arch Model to use
-e , --epoch Number of epochs
-gb , --global_batch_size
Global batch size
-w , --worker Number of workers/sub-batches (Note: global batch size must be divisible by number of workers))
-f [ ...], --faulty [ ...]
Indics of faulty worker (Ex: -f 0 1 2)
-df , --defense Defense method
--proc Number of processes
-tb , --tb Tensorboard log directory
--device Device to use
--correct Whether to use error correction
--data_coll_path Path to data collection
Running DDP training with 16 workers, 5 of which are malicious, on MNIST dataset with error correction.
python main.py -d mnist -w 16 -f 0 1 2 3 4 --correct
Device: cpu
Number of workers: 16
Number of processes: 1
Faulty worker idxs: [0, 1, 2, 3, 4]
Namespace(dataset='mnist', arch=None, epoch=3, global_batch_size=512, worker=16, faulty=[0, 1, 2, 3, 4], defense=None, proc=1, tb=None, device='cpu', correct=True, data_coll_path=None)
Worker-to-dataBatch assignments
W-B MAP: {0: 7, 1: 7, 2: 7, 3: 7, 4: 7, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15}
[AGG] Init correction model for worker 0
[AGG] Init correction model for worker 1
[AGG] Init correction model for worker 2
[AGG] Init correction model for worker 3
[AGG] Init correction model for worker 4
[CORRECTION] Training ...
[CORRECTION] Training loss (MSE, MAPE): 6.644912e-07 0.0004825247
[CORRECTION] Training ...
[CORRECTION] Training loss (MSE, MAPE): 6.644768e-07 0.00048268036
[CORRECTION] Training ...
[CORRECTION] Training loss (MSE, MAPE): 6.644902e-07 0.0004825253
[CORRECTION] Training ...
[CORRECTION] Training loss (MSE, MAPE): 6.6449115e-07 0.00048252486
[CORRECTION] Training ...
[CORRECTION] Training loss (MSE, MAPE): 6.644911e-07 0.00048252384
EP: 1/3, sub-batch: 1/118, avg sub-batch loss: 2.295
W-B MAP: {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15}
EP: 1/3, sub-batch: 2/118, avg sub-batch loss: 2.312
W-B MAP: {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15}
EP: 1/3, sub-batch: 3/118, avg sub-batch loss: 2.306
W-B MAP: {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15}
EP: 1/3, sub-batch: 4/118, avg sub-batch loss: 2.308
W-B MAP: {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15}
EP: 1/3, sub-batch: 5/118, avg sub-batch loss: 2.298
W-B MAP: {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 10: 10, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15}
...
tensorboard --logdir=runs