A debugger for DeepSpeed engines. ModelStepper tracks model parameters, gradients, and loss with configurable tolerances.
ModelStepper accepts two training engines as returned from deepspeed.initialize()
(one
baseline and one test) and a DataLoader
for training. ModelStepper's go()
method will
train some number of batches and track the specified values (i.e., parameters, loss,
and/or gradients). If a tracked component diverges from the baseline within a specified
tolerance, go()
returns False
and reports information on the divergence.
Note: the divergence of parameters and gradients is currently decided by the
relative difference between the tensors, i.e., ((B - A).norm() / A.norm())
. The
absolute difference is still communicated when diverged.
Assumptions:
- The user must ensure that the baseline and tracked model are initialized with the same state.
- If parameters or gradients are tracked, the models are aligned such
base_eng.module.parameters()
are comparabletest_eng.module.parameters()
. In the near future, we should support doing anall_gather()
to coordinate with varying model parallelism.
ModelStepper has a small API:
stepper = ModelStepper(base_engine,
test_engine,
trainloader,
num_batches=50,
test_every=1)
success = stepper.go()
Check out demo.py and ModelStepper.py for more details.
Try the demo:
$ deepspeed demo.py --deepspeed --deepspeed_config=ds_config.json
<snip>
--- Model Stepper Configuration ---
batches=50
test_every=1
status_every=5
track_params=True
param_tol=1.000000e-05
track_loss=True
loss_tol=1.000000e-05
track_grads=True
grad_tol=1.000000e-05
STATUS batch=0 / 50 base_loss=2.30138 test_loss=2.30138 abs_diff=0.00000e+00 rel_diff=0.00000e+00
STATUS batch=5 / 50 base_loss=2.29744 test_loss=2.29744 abs_diff=0.00000e+00 rel_diff=0.00000e+00
STATUS batch=10 / 50 base_loss=2.25951 test_loss=2.25951 abs_diff=0.00000e+00 rel_diff=0.00000e+00
STATUS batch=15 / 50 base_loss=2.19609 test_loss=2.19609 abs_diff=0.00000e+00 rel_diff=0.00000e+00
STATUS batch=20 / 50 base_loss=2.12497 test_loss=2.12497 abs_diff=0.00000e+00 rel_diff=0.00000e+00
STATUS batch=25 / 50 base_loss=2.05403 test_loss=2.05403 abs_diff=2.38419e-07 rel_diff=1.16074e-07
STATUS batch=30 / 50 base_loss=1.99819 test_loss=1.99819 abs_diff=1.19209e-07 rel_diff=5.96587e-08
STATUS batch=35 / 50 base_loss=1.97918 test_loss=1.97918 abs_diff=0.00000e+00 rel_diff=0.00000e+00
STATUS batch=40 / 50 base_loss=1.98365 test_loss=1.98365 abs_diff=0.00000e+00 rel_diff=0.00000e+00
STATUS batch=45 / 50 base_loss=1.85610 test_loss=1.85610 abs_diff=0.00000e+00 rel_diff=0.00000e+00
STATUS batch=49 / 50 base_loss=1.89003 test_loss=1.89003 abs_diff=0.00000e+00 rel_diff=0.00000e+00
TEST PASSED
In contrast, here is the result of the --fail
flag to demo a test failure. This
mode sets lr=0
in the tested model:
$ deepspeed demo.py --deepspeed --deepspeed_config=ds_config.json --fail
<snip>
DIVERGED PARAMETER rank=2 batch=0 param_idx=0 abs_diff=2.11209e-02 rel_diff=1.47273e-02 tol=1.00000e-04
DIVERGED PARAMETER rank=0 batch=0 param_idx=0 abs_diff=2.11209e-02 rel_diff=1.47273e-02 tol=1.00000e-04
DIVERGED PARAMETER rank=3 batch=0 param_idx=0 abs_diff=2.11209e-02 rel_diff=1.47273e-02 tol=1.00000e-04
DIVERGED PARAMETER rank=1 batch=0 param_idx=0 abs_diff=2.11209e-02 rel_diff=1.47273e-02 tol=1.00000e-04
STATUS batch=0 / 50 base_loss=2.30138 test_loss=2.30138 abs_diff=0.00000e+00 rel_diff=0.00000e+00
TEST FAILED
ModelStepper immediately detects that the model parameters have diverged from the baseline.