pytorch/kineto

tb_plugin is failing test_compare_with_autograd.py

aaronenyeshi opened this issue · 2 comments

The tb_plugin CI is failing tb_plugin/test/test_compare_with_autograd.py on the latest trunk. This may be caused by new nightly torch or torchvision packages.

Error Snippet:

============================= test session starts ==============================
platform linux -- Python 3.8.16, pytest-7.3.2, pluggy-1.0.0
rootdir: /home/runner/work/kineto/kineto/tb_plugin
collected 36 items / 1 error

==================================== ERRORS ====================================
_____________ ERROR collecting test/test_compare_with_autograd.py ______________
test_compare_with_autograd.py:10: in <module>
    import torchvision
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/torchvision/__init__.py:6: in <module>
    from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/torchvision/_meta_registrations.py:25: in <module>
    def meta_roi_align(input, rois, spatial_scale, pooled_height, pooled_width, sampling_ratio, aligned):
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/torchvision/_meta_registrations.py:18: in wrapper
    get_meta_lib().impl(getattr(getattr(torch.ops.torchvision, op_name), overload_name), fn)
/opt/hostedtoolcache/Python/3.8.16/x64/lib/python3.8/site-packages/torch/library.py:129: in impl
    raise RuntimeError(
E   RuntimeError: We should not register a meta kernel directly to the operator 'torchvision::roi_align', because it has a CompositeImplicitAutograd kernel in core. Instead we should let the operator decompose, and ensure that we have meta kernels for the base ops that it decomposes into.
=========================== short test summary info ============================
ERROR test_compare_with_autograd.py - RuntimeError: We should not register a meta kernel directly to the operator 'torchvision::roi_align', because it has a CompositeImplicitAutograd kernel in core. Instead we should let the operator decompose, and ensure that we have meta kernels for the base ops that it decomposes into.
!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!
=============================== 1 error in 2.18s ===============================
Error: Process completed with exit code 2.

Logs can be found here:
https://github.com/pytorch/kineto/actions/runs/5263899358/jobs/9514536899
https://github.com/pytorch/kineto/actions/runs/5259252736/jobs/9508333740

I'm closing this for now because I don't see the failures in these tests now (e.g. see # 772 https://github.com/pytorch/kineto/actions/runs/5316502635/jobs/9626100736?pr=772). I'm guessing this was likely some transient compatibility issue between the pytorch repo and the torchvision repo, unrelated to kineto.