model longformer multi-card training evaluation process did not carry out multi-card average
Closed this issue · 11 comments
In the multi-card training, it is necessary to carry out intermittent model accuracy evaluation, but the longformer model did not average the evaluation results during the multi-card evaluation process. Although each card does full data inference, it is not guaranteed that the inference results of all cards are consistent (due to the existence of random factors such as drop out). As a result, the evaluation results of different cards are different, and probability affects the state of state, leading to the stop of training after some cards reach the accuracy, while some cards do not reach the accuracy and continue to train, resulting in the illusion that the machine is hung. It is necessary to all_gather the information of all cards in the evaluation process and then average.
Thanks for the report @happyxuwork. please provide a screenshot of the running process or a code snippet. Besides, please @reiase confirm if the status is True. If True, here's some solution:
- please @happyxuwork raise a PR to fix this, i'll reivew and merge it.
- please @reiase raise a PR to fix this.
- please @happyxuwork decribed the bug in detail, i'll working on a fix later
as show above, different cart may have little diff, if target acc is 0.64, the rank4 will stop and other will contune, resulting in the illusion that the machine is hang, so do average between all cart will keep the acc all time, and state of training same too.
information is acknowledged @happyxuwork. Are you willing to submit a PR for this? if not, i'll working on it later
@shh2000 Follow flagperf's other models , you unified treatment is more appropriate
@shh2000 Follow flagperf's other models , you unified treatment is more appropriate
ok, i'll working on it later
@shh2000 you can just replace https://github.com/FlagOpen/FlagPerf/blob/03d762cd472591783520c73514cff2551c92a830/training/benchmarks/longformer/pytorch/train/evaluator.py#L42C9-L42C42 as fellow, can fix this
acc = total_output / num_examples
acc = reduce_tensor(acc)
def reduce_tensor(tensor):
rt = tensor.clone()
if dist.is_available() and dist.is_initialized():
dist.all_reduce(rt, op=dist.ReduceOp.SUM)
rt /= dist.get_world_size()
else:
return tensor
return rt
@shh2000 you can just replace https://github.com/FlagOpen/FlagPerf/blob/03d762cd472591783520c73514cff2551c92a830/training/benchmarks/longformer/pytorch/train/evaluator.py#L42C9-L42C42 as fellow, can fix this
acc = total_output / num_examples
acc = reduce_tensor(acc)
def reduce_tensor(tensor):
rt = tensor.clone()
if dist.is_available() and dist.is_initialized():
dist.all_reduce(rt, op=dist.ReduceOp.SUM)
rt /= dist.get_world_size()
else:
return tensor
return rt
add new pr to fix this #499