Question about missed values in `batch_mimicit`, maybe because of using `delete_tensors_from_dict` and `pop`
Li-Qingyun opened this issue · 0 comments
Hi, it seems that the pop
and delete_tensors_from_dict
operation results in missing of some samples.
What is this operation for?
The operations includes:
delete_tensors_from_dict(batch_mimicit)
delete_tensors_from_dict(
{
"other": [
images,
input_ids,
attention_mask,
labels,
]
}
)
I met empty net_input
at the tail of an epoch training.
But the logic in collate_fn seems that it at least has input_ids
and attention_masks
.
I has not found why the net_input
is empty.
but when i delete delete_tensors_from_dict in my debugging script (pass the forward/backward process, only keep data loading to speed up err raising), the bug disappears.
It is strange.
Additionally, the actual bug is nccl timeout ...[Rank x] Watchdog caught collective operation timeout: WorkNCCL...
. I found the true err by adding this:
def train_one_epoch(args, model, epoch, mimicit_loaders, tokenizer, optimizer, lr_scheduler, device_id, accelerator, wandb):
dataloader_iterators = [cycle(dataloader) for dataloader in mimicit_loaders]
......
for num_steps in tqdm(range(args.total_training_steps), disable=args.rank != 0, initial=(epoch * num_batches_per_epoch)):
......
#### MIMIC-IT FORWARD PASS ####
try:
net_input = batch_mimicit["net_input"]
images = net_input["patch_images"].to(device_id, non_blocking=True)
input_ids = net_input["input_ids"].to(device_id, non_blocking=True)
attention_mask = net_input["attention_masks"].to(device_id, non_blocking=True)
labels = None # placeholder to avoid error
# net_input = batch_mimicit.pop("net_input")
# images = net_input.pop("patch_images").to(device_id, non_blocking=True)
# input_ids = net_input.pop("input_ids").to(device_id, non_blocking=True)
# attention_mask = net_input.pop("attention_masks").to(device_id, non_blocking=True)
# labels = None # placeholder to avoid error
except Exception as err:
print(f"\n\n\n{err}\n")
print(batch_mimicit)
exit()
......
try:
loss_mimicit = forward_pass(
args,
model,
tokenizer,
images,
input_ids,
attention_mask,
labels,
device_id,
autocast_type,
batch_mimicit,
)
except Exception as err:
print(f"\nforward_pass\n")
print(f"\n\n\n{err}\n")
print(batch_mimicit)
exit()
......
# delete_tensors_from_dict(batch_mimicit)
# delete_tensors_from_dict(
# {
# "other": [
# images,
# input_ids,
# attention_mask,
# labels,
# ]
# }
# )
......
Maybe cycle(dataloader) makes some sample appeared again? but their value was deleted by the func?