pytorch/ort

RecursionError: maximum recursion depth exceeded in comparison

Closed this issue · 7 comments

I use ort like this:

...
model = nn.SyncBatchNorm.convert_sync_batchnorm(model)
model = ORTModule(model)
model = nn.parallel.DistributedDataParallel(model, find_unused_parameters=True, device_ids=[device])
...

But found error:

Traceback (most recent call last):
  File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/home/users/min.du/hdlt/feature_j5fsd_configs/HDLT/hdlt/engine/ddp_trainer.py", line 156, in _main_func
    main_func(local_rank, *args)
  File "/home/users/min.du/hdlt/feature_j5fsd_configs/HDLT/tools/train.py", line 163, in train_entrance
    trainer.fit()
  File "/home/users/min.du/hdlt/feature_j5fsd_configs/HDLT/tools/trainer_wrapper.py", line 225, in fit
    self._trainer.fit()
  File "/home/users/min.du/hdlt/feature_j5fsd_configs/HDLT/hdlt/engine/trainer.py", line 298, in fit
    profiler=self.profiler,
  File "/home/users/min.du/hdlt/feature_j5fsd_configs/HDLT/hdlt/engine/processors/processor.py", line 265, in __call__
    model_outs = model(*_as_list(batch_i))
  File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/onnxruntime/training/ortmodule/ortmodule.py", line 41, in _forward
    return self._execution_manager(self._is_training()).forward(*inputs, **kwargs)
  File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/onnxruntime/training/ortmodule/_training_manager.py", line 67, in forward
    build_gradient_graph = self._export_model(*inputs, **kwargs)
  File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/onnxruntime/training/ortmodule/_graph_execution_manager.py", line 206, in _export_model
    schema = _io._extract_schema({'args': copy.copy(inputs), 'kwargs': copy.copy(kwargs)})
  File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/onnxruntime/training/ortmodule/_io.py", line 300, in _extract_schema
    data[key] = _extract_schema(data[key])
  File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/onnxruntime/training/ortmodule/_io.py", line 291, in _extract_schema
    data[idx] = _extract_schema(data[idx])
  File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/onnxruntime/training/ortmodule/_io.py", line 291, in _extract_schema
    data[idx] = _extract_schema(data[idx])
  File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/onnxruntime/training/ortmodule/_io.py", line 291, in _extract_schema
    data[idx] = _extract_schema(data[idx])
  [Previous line repeated 949 more times]
  File "/home/users/min.du/venvs/pytorch1.8/lib/python3.6/site-packages/onnxruntime/training/ortmodule/_io.py", line 287, in _extract_schema
    if isinstance(data, abc.Sequence):
  File "/home/users/min.du/venvs/pytorch1.8/lib64/python3.6/abc.py", line 184, in __instancecheck__
    if subclass in cls._abc_cache:
  File "/home/users/min.du/venvs/pytorch1.8/lib64/python3.6/_weakrefset.py", line 75, in __contains__
    return wr in self.data
RecursionError: maximum recursion depth exceeded in comparison

Any suggestion?

Any chance your input has strings in it? and such string would have 949 chars

Can you retest? microsoft/onnxruntime#8098 may have fixed your issue (merged the same day you created the issue)

@DuinoDu - after the re-test, if the issue persists, can u pls provide re-producible steps/code scripts, and possibly along with the model? thx

Hi, @DuinoDu without waiting, I cloned pytorch repo, and leverage the UTs in the file for the same:
~/pytorch/torch/testing/_internal/distributed/distributed_test.py

However, I can't reproduce your error with it. (Although there's other error encountered, but it's unrelated. And the fix will be in future release.) Therefore it will be nice if you can provide more details, i.e., small reproducible case if possible, if you still see the issue with the fix (microsoft/onnxruntime#8098).
Thanks.

@DuinoDu FYI - by reverting the fix, I can repro the same exception.

FAILED orttraining_test_ortmodule_api.py::test_input_with_string_exception - RecursionError: maximum recursion depth exceeded in comparison

natke commented

Hi @DuinoDu, we are closing this issue now, as we believe we have resolved it. Please re-open or create a new issue if you need more assistance. Thank you!

@DuinoDu - please feel free to re-open it if you still have the same issue.
Thanks.