FlagOpen/FlagPerf

model longformer multi-card training evaluation process did not carry out multi-card average

Closed this issue · 11 comments

In the multi-card training, it is necessary to carry out intermittent model accuracy evaluation, but the longformer model did not average the evaluation results during the multi-card evaluation process. Although each card does full data inference, it is not guaranteed that the inference results of all cards are consistent (due to the existence of random factors such as drop out). As a result, the evaluation results of different cards are different, and probability affects the state of state, leading to the stop of training after some cards reach the accuracy, while some cards do not reach the accuracy and continue to train, resulting in the illusion that the machine is hung. It is necessary to all_gather the information of all cards in the evaluation process and then average.

Thanks for the report @happyxuwork. please provide a screenshot of the running process or a code snippet. Besides, please @reiase confirm if the status is True. If True, here's some solution:

  1. please @happyxuwork raise a PR to fix this, i'll reivew and merge it.
  2. please @reiase raise a PR to fix this.
  3. please @happyxuwork decribed the bug in detail, i'll working on a fix later

Some log evidence:
rank0 log:
1711351361487_5303F190-E8A1-4654-A686-4D2641E94DC6

rank4 log:
1711351529453_A7CB6CCF-91BD-4cb0-8AF0-057A74DAD160

as show above, different cart may have little diff, if target acc is 0.64, the rank4 will stop and other will contune, resulting in the illusion that the machine is hang, so do average between all cart will keep the acc all time, and state of training same too.

information is acknowledged @happyxuwork. Are you willing to submit a PR for this? if not, i'll working on it later

@shh2000 Follow flagperf's other models , you unified treatment is more appropriate

@shh2000 Follow flagperf's other models , you unified treatment is more appropriate

ok, i'll working on it later

@shh2000 you can just replace https://github.com/FlagOpen/FlagPerf/blob/03d762cd472591783520c73514cff2551c92a830/training/benchmarks/longformer/pytorch/train/evaluator.py#L42C9-L42C42 as fellow, can fix this
acc = total_output / num_examples
acc = reduce_tensor(acc)
def reduce_tensor(tensor):
rt = tensor.clone()
if dist.is_available() and dist.is_initialized():
dist.all_reduce(rt, op=dist.ReduceOp.SUM)
rt /= dist.get_world_size()
else:
return tensor
return rt

@shh2000 you can just replace https://github.com/FlagOpen/FlagPerf/blob/03d762cd472591783520c73514cff2551c92a830/training/benchmarks/longformer/pytorch/train/evaluator.py#L42C9-L42C42 as fellow, can fix this
acc = total_output / num_examples
acc = reduce_tensor(acc)
def reduce_tensor(tensor):
rt = tensor.clone()
if dist.is_available() and dist.is_initialized():
dist.all_reduce(rt, op=dist.ReduceOp.SUM)
rt /= dist.get_world_size()
else:
return tensor
return rt

add new pr to fix this #499

Thanks for your PR #499

#500 is the same as #499 . I've merged it into an temporary branch, and metax #502 has also been merged.

#499 would still remain opened for the developer of longformer. He'll check the status. If True, the temporary branch would be merged into main.

Longformer-nvidia's developer has already confirmed the status, this issue would be transfered to #532, this PR will be merged soon.