fedbiomed/fedbiomed

Researcher component needs better memory management

Closed this issue · 2 comments

While testing big neural network models over many training rounds I have encountered memory issues. Please see details below.

Model size: 600MB
Number of rounds of training: 100
Number of nodes: 3
Dry run: True
Operating System: Mac M3
Tested: using Pytest end-2-end machinery, Jupyter Notebook, and using plain python3.

After reaching round 22 the memory usage of program (researcher) starts to go over 32 MB which ends up with the following errors while using python or pytest end-2-end faiclity:

envs/fedbiomed-researcher/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

on jupyter notebook after reaching round 22 the kernel dies.

This mean that researcher component can only handle 200 rounds for average 60MB model, 2000 round for 6MB model. This is due to training_replies that keeps all the previous aggregated and individual model weights during the rounds of training. This also points out that researcher component can handle less round as number of nodes increases because there will be more model weights kept in the training replies object. Additionally, the number may vary depending on secure aggregation activation state since it may increase volume of the individual encrypted model wights.

Redesigning the training replies object to keep only the last aggregated model weights in the memory and load other model weights from the file system when it is needed can solve the big part of memory issue.

Hi @srcansiz

This is a know behaviour/limitation :-)

To avoid such issue you can use (currently in develop not in master) you can use the following to keep only training_replies for last round:

exp.set_retain_full_history(False)

We also noted in #207 a point to (possibly) re-implement training replies so as to keep only last round in memory (and other rounds on disk for minimal memory impact).

Hi @mvesin,

Thank you very much. I totally forgot that there is this method to avoid this issue. I am going to try the run tests by disabling retain_full_history.