RecursionError: maximum recursion depth exceeded in comparison
Closed this issue · 7 comments
I use ort
like this:
...
model = nn.SyncBatchNorm.convert_sync_batchnorm(model)
model = ORTModule(model)
model = nn.parallel.DistributedDataParallel(model, find_unused_parameters=True, device_ids=[device])
...
But found error:
Traceback (most recent call last):
File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/users/min.du/hdlt/feature_j5fsd_configs/HDLT/hdlt/engine/ddp_trainer.py", line 156, in _main_func
main_func(local_rank, *args)
File "/home/users/min.du/hdlt/feature_j5fsd_configs/HDLT/tools/train.py", line 163, in train_entrance
trainer.fit()
File "/home/users/min.du/hdlt/feature_j5fsd_configs/HDLT/tools/trainer_wrapper.py", line 225, in fit
self._trainer.fit()
File "/home/users/min.du/hdlt/feature_j5fsd_configs/HDLT/hdlt/engine/trainer.py", line 298, in fit
profiler=self.profiler,
File "/home/users/min.du/hdlt/feature_j5fsd_configs/HDLT/hdlt/engine/processors/processor.py", line 265, in __call__
model_outs = model(*_as_list(batch_i))
File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/onnxruntime/training/ortmodule/ortmodule.py", line 41, in _forward
return self._execution_manager(self._is_training()).forward(*inputs, **kwargs)
File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/onnxruntime/training/ortmodule/_training_manager.py", line 67, in forward
build_gradient_graph = self._export_model(*inputs, **kwargs)
File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 206, in _export_model
schema = _io._extract_schema({'args': copy.copy(inputs), 'kwargs': copy.copy(kwargs)})
File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/onnxruntime/training/ortmodule/_io.py", line 300, in _extract_schema
data[key] = _extract_schema(data[key])
File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/onnxruntime/training/ortmodule/_io.py", line 291, in _extract_schema
data[idx] = _extract_schema(data[idx])
File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/onnxruntime/training/ortmodule/_io.py", line 291, in _extract_schema
data[idx] = _extract_schema(data[idx])
File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/onnxruntime/training/ortmodule/_io.py", line 291, in _extract_schema
data[idx] = _extract_schema(data[idx])
[Previous line repeated 949 more times]
File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/onnxruntime/training/ortmodule/_io.py", line 287, in _extract_schema
if isinstance(data, abc.Sequence):
File "/home/users/min.du/venvs/pytorch1.8/lib64/python3.6/abc.py", line 184, in __instancecheck__
if subclass in cls._abc_cache:
File "/home/users/min.du/venvs/pytorch1.8/lib64/python3.6/_weakrefset.py", line 75, in __contains__
return wr in self.data
RecursionError: maximum recursion depth exceeded in comparison
Any suggestion?
Any chance your input has strings in it? and such string would have 949 chars
Can you retest? microsoft/onnxruntime#8098 may have fixed your issue (merged the same day you created the issue)
@DuinoDu - after the re-test, if the issue persists, can u pls provide re-producible steps/code scripts, and possibly along with the model? thx
Hi, @DuinoDu without waiting, I cloned pytorch repo, and leverage the UTs in the file for the same:
~/pytorch/torch/testing/_internal/distributed/distributed_test.py
However, I can't reproduce your error with it. (Although there's other error encountered, but it's unrelated. And the fix will be in future release.) Therefore it will be nice if you can provide more details, i.e., small reproducible case if possible, if you still see the issue with the fix (microsoft/onnxruntime#8098).
Thanks.
@DuinoDu FYI - by reverting the fix, I can repro the same exception.
FAILED orttraining_test_ortmodule_api.py::test_input_with_string_exception - RecursionError: maximum recursion depth exceeded in comparison