Researcher component needs better memory management
Closed this issue · 2 comments
While testing big neural network models over many training rounds I have encountered memory issues. Please see details below.
Model size: 600MB
Number of rounds of training: 100
Number of nodes: 3
Dry run: True
Operating System: Mac M3
Tested: using Pytest end-2-end machinery, Jupyter Notebook, and using plain python3.
After reaching round 22 the memory usage of program (researcher) starts to go over 32 MB which ends up with the following errors while using python or pytest end-2-end faiclity:
envs/fedbiomed-researcher/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
on jupyter notebook after reaching round 22 the kernel dies.
This mean that researcher component can only handle 200 rounds for average 60MB model, 2000 round for 6MB model. This is due to training_replies
that keeps all the previous aggregated and individual model weights during the rounds of training. This also points out that researcher component can handle less round as number of nodes increases because there will be more model weights kept in the training replies object. Additionally, the number may vary depending on secure aggregation activation state since it may increase volume of the individual encrypted model wights.
Redesigning the training replies object to keep only the last aggregated model weights in the memory and load other model weights from the file system when it is needed can solve the big part of memory issue.
Hi @srcansiz
This is a know behaviour/limitation :-)
To avoid such issue you can use (currently in develop
not in master
) you can use the following to keep only training_replies
for last round:
exp.set_retain_full_history(False)
We also noted in #207 a point to (possibly) re-implement training replies so as to keep only last round in memory (and other rounds on disk for minimal memory impact).