modelscope/data-juicer

potential bug of checkpointing

drcege opened this issue · 3 comments

As the title suggests, the issue lies in the following code.

recorded_op_num = len(self.op_record)
prefix_process = self.process_list[:recorded_op_num]
all_the_same = True
dif1, dif2 = None, None
for record_op, config_op in zip(self.op_record, prefix_process):
if record_op != config_op:
all_the_same = False
dif1, dif2 = record_op, config_op
break
if all_the_same:
for op in self.op_record:
op_name = list(op.keys())[0]
logger.info(f'Skip op [{op_name}].')
self.process_list = self.process_list[recorded_op_num:]
return True

When the new process_list is shorter than op_record, Python does not raise an error for out-of-range indices but rather truncates to the maximum available length, thus len(prefix_process) < len(self.op_record). Similarly, the zip function terminates at the shorter iterable's length. This results in the check_ops_to_skip function incorrectly assuming that the recorded operators match the prefix of the current operators list.

Is that the case? @HYLcool @yxdyc

Yes, that's a problem when meeting this situation. 👍🏻

This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.

Close this stale issue.