Cuda sync issue after update base image
KumoLiu opened this issue · 1 comments
KumoLiu commented
Starting test: test_value_0_fp32 (tests.test_convert_to_trt.TestConvertToTRT)...
WARNING:root:Given dtype that does not have direct mapping to torch (dtype.unknown), defaulting to torch.float
WARNING:root:Given dtype that does not have direct mapping to torch (dtype.unknown), defaulting to torch.float
WARNING: [Torch-TensorRT] - Detected and removing exception in TorchScript IR for node: = prim::If(%387) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:562:8 block0(): %388 : str = aten::format(%318, %386) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:29 = prim::RaiseException(%388, %317) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:12 -> () block1(): -> ()
WARNING: [Torch-TensorRT] - Detected and removing exception in TorchScript IR for node: = prim::If(%401) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:562:8 block0(): %402 : str = aten::format(%318, %400) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:29 = prim::RaiseException(%402, %317) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:12 -> () block1(): -> ()
WARNING: [Torch-TensorRT] - Detected and removing exception in TorchScript IR for node: = prim::If(%415) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:562:8 block0(): %416 : str = aten::format(%318, %414) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:29 = prim::RaiseException(%416, %317) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:12 -> () block1(): -> ()
WARNING: [Torch-TensorRT] - Detected and removing exception in TorchScript IR for node: = prim::If(%429) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:562:8 block0(): %430 : str = aten::format(%318, %428) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:29 = prim::RaiseException(%430, %317) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:12 -> () block1(): -> ()
WARNING: [Torch-TensorRT] - Conv3d layer with kernel size = 1 configuration incurs a failure with TensorRT tactic optimizer in some cases. Github issue: https://github.com/pytorch/TensorRT/issues/1445. Other conv variants do not have this issue.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Environment variable NVIDIA_TF32_OVERRIDE=0 but BuilderFlag::kTF32 is set. Disabling TF32.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Environment variable NVIDIA_TF32_OVERRIDE=0 but BuilderFlag::kTF32 is set. Disabling TF32.
.Finished test: test_value_0_fp32 (tests.test_convert_to_trt.TestConvertToTRT) (32.1s)
Starting test: test_value_1_fp16 (tests.test_convert_to_trt.TestConvertToTRT)...
WARNING:root:Given dtype that does not have direct mapping to torch (dtype.unknown), defaulting to torch.float
WARNING:root:Given dtype that does not have direct mapping to torch (dtype.unknown), defaulting to torch.float
WARNING: [Torch-TensorRT] - Detected and removing exception in TorchScript IR for node: = prim::If(%387) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:562:8 block0(): %388 : str = aten::format(%318, %386) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:29 = prim::RaiseException(%388, %317) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:12 -> () block1(): -> ()
WARNING: [Torch-TensorRT] - Detected and removing exception in TorchScript IR for node: = prim::If(%401) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:562:8 block0(): %402 : str = aten::format(%318, %400) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:29 = prim::RaiseException(%402, %317) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:12 -> () block1(): -> ()
WARNING: [Torch-TensorRT] - Detected and removing exception in TorchScript IR for node: = prim::If(%415) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:562:8 block0(): %416 : str = aten::format(%318, %414) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:29 = prim::RaiseException(%416, %317) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:12 -> () block1(): -> ()
WARNING: [Torch-TensorRT] - Detected and removing exception in TorchScript IR for node: = prim::If(%429) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:562:8 block0(): %430 : str = aten::format(%318, %428) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:29 = prim::RaiseException(%430, %317) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:12 -> () block1(): -> ()
WARNING: [Torch-TensorRT] - Conv3d layer with kernel size = 1 configuration incurs a failure with TensorRT tactic optimizer in some cases. Github issue: https://github.com/pytorch/TensorRT/issues/1445. Other conv variants do not have this issue.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Environment variable NVIDIA_TF32_OVERRIDE=0 but BuilderFlag::kTF32 is set. Disabling TF32.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Environment variable NVIDIA_TF32_OVERRIDE=0 but BuilderFlag::kTF32 is set. Disabling TF32.
======================================================================
FAIL: test_value_043_cuda (tests.test_hausdorff_distance.TestHausdorffDistance)
device: cuda metric: euclidean directed:False expected: 20.223748416156685
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/parameterized/parameterized.py", line 620, in standalone_func
return func(*(a + p.args), **p.kwargs, **kw)
File "/workspace/MONAI/tests/test_hausdorff_distance.py", line 194, in test_value
np.testing.assert_allclose(expected_value, result.cpu(), rtol=1e-6)
File "/usr/local/lib/python3.10/dist-packages/numpy/testing/_private/utils.py", line 1592, in assert_allclose
assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
File "/usr/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/usr/local/lib/python3.10/dist-packages/numpy/testing/_private/utils.py", line 862, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-06, atol=0
Mismatched elements: 1 / 1 (100%)
Max absolute difference: 15.198812
Max relative difference: 3.0246766
x: array(20.223748)
y: array([5.024938], dtype=float32)
======================================================================
FAIL: test_value_078_cuda (tests.test_hausdorff_distance.TestHausdorffDistance)
device: cuda metric: euclidean directed:True expected: 19.924858845171276
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/parameterized/parameterized.py", line 620, in standalone_func
return func(*(a + p.args), **p.kwargs, **kw)
File "/workspace/MONAI/tests/test_hausdorff_distance.py", line 194, in test_value
np.testing.assert_allclose(expected_value, result.cpu(), rtol=1e-6)
File "/usr/local/lib/python3.10/dist-packages/numpy/testing/_private/utils.py", line 1592, in assert_allclose
assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
File "/usr/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/usr/local/lib/python3.10/dist-packages/numpy/testing/_private/utils.py", line 862, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-06, atol=0
Mismatched elements: 1 / 1 (100%)
Max absolute difference: 19.924858
Max relative difference: inf
x: array(19.924859)
y: array([0.], dtype=float32)
KumoLiu commented
The issue occurred when update base image to 24.08