Cuda sync issue after update base image

Question

Cuda sync issue after update base image

KumoLiu opened this issue 3 months ago · 1 comments

Starting test: test_value_0_fp32 (tests.test_convert_to_trt.TestConvertToTRT)...
WARNING:root:Given dtype that does not have direct mapping to torch (dtype.unknown), defaulting to torch.float
WARNING:root:Given dtype that does not have direct mapping to torch (dtype.unknown), defaulting to torch.float
WARNING: [Torch-TensorRT] - Detected and removing exception in TorchScript IR for node:  = prim::If(%387) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:562:8  block0():    %388 : str = aten::format(%318, %386) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:29     = prim::RaiseException(%388, %317) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:12    -> ()  block1():    -> ()
WARNING: [Torch-TensorRT] - Detected and removing exception in TorchScript IR for node:  = prim::If(%401) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:562:8  block0():    %402 : str = aten::format(%318, %400) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:29     = prim::RaiseException(%402, %317) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:12    -> ()  block1():    -> ()
WARNING: [Torch-TensorRT] - Detected and removing exception in TorchScript IR for node:  = prim::If(%415) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:562:8  block0():    %416 : str = aten::format(%318, %414) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:29     = prim::RaiseException(%416, %317) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:12    -> ()  block1():    -> ()
WARNING: [Torch-TensorRT] - Detected and removing exception in TorchScript IR for node:  = prim::If(%429) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:562:8  block0():    %430 : str = aten::format(%318, %428) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:29     = prim::RaiseException(%430, %317) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:12    -> ()  block1():    -> ()
WARNING: [Torch-TensorRT] - Conv3d layer with kernel size = 1 configuration incurs a failure with TensorRT tactic optimizer in some cases.     Github issue: https://github.com/pytorch/TensorRT/issues/1445. Other conv variants do not have this issue.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Environment variable NVIDIA_TF32_OVERRIDE=0 but BuilderFlag::kTF32 is set. Disabling TF32.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Environment variable NVIDIA_TF32_OVERRIDE=0 but BuilderFlag::kTF32 is set. Disabling TF32.
.Finished test: test_value_0_fp32 (tests.test_convert_to_trt.TestConvertToTRT) (32.1s)
Starting test: test_value_1_fp16 (tests.test_convert_to_trt.TestConvertToTRT)...
WARNING:root:Given dtype that does not have direct mapping to torch (dtype.unknown), defaulting to torch.float
WARNING:root:Given dtype that does not have direct mapping to torch (dtype.unknown), defaulting to torch.float
WARNING: [Torch-TensorRT] - Detected and removing exception in TorchScript IR for node:  = prim::If(%387) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:562:8  block0():    %388 : str = aten::format(%318, %386) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:29     = prim::RaiseException(%388, %317) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:12    -> ()  block1():    -> ()
WARNING: [Torch-TensorRT] - Detected and removing exception in TorchScript IR for node:  = prim::If(%401) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:562:8  block0():    %402 : str = aten::format(%318, %400) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:29     = prim::RaiseException(%402, %317) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:12    -> ()  block1():    -> ()
WARNING: [Torch-TensorRT] - Detected and removing exception in TorchScript IR for node:  = prim::If(%415) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:562:8  block0():    %416 : str = aten::format(%318, %414) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:29     = prim::RaiseException(%416, %317) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:12    -> ()  block1():    -> ()
WARNING: [Torch-TensorRT] - Detected and removing exception in TorchScript IR for node:  = prim::If(%429) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:562:8  block0():    %430 : str = aten::format(%318, %428) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:29     = prim::RaiseException(%430, %317) # /usr/local/lib/python3.10/dist-packages/torch/nn/modules/batchnorm.py:563:12    -> ()  block1():    -> ()
WARNING: [Torch-TensorRT] - Conv3d layer with kernel size = 1 configuration incurs a failure with TensorRT tactic optimizer in some cases.     Github issue: https://github.com/pytorch/TensorRT/issues/1445. Other conv variants do not have this issue.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Environment variable NVIDIA_TF32_OVERRIDE=0 but BuilderFlag::kTF32 is set. Disabling TF32.
WARNING: [Torch-TensorRT TorchScript Conversion Context] - Environment variable NVIDIA_TF32_OVERRIDE=0 but BuilderFlag::kTF32 is set. Disabling TF32.

======================================================================
FAIL: test_value_043_cuda (tests.test_hausdorff_distance.TestHausdorffDistance)
device: cuda metric: euclidean directed:False expected: 20.223748416156685
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/parameterized/parameterized.py", line 620, in standalone_func
    return func(*(a + p.args), **p.kwargs, **kw)
  File "/workspace/MONAI/tests/test_hausdorff_distance.py", line 194, in test_value
    np.testing.assert_allclose(expected_value, result.cpu(), rtol=1e-6)
  File "/usr/local/lib/python3.10/dist-packages/numpy/testing/_private/utils.py", line 1592, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/usr/local/lib/python3.10/dist-packages/numpy/testing/_private/utils.py", line 862, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=1e-06, atol=0

Mismatched elements: 1 / 1 (100%)
Max absolute difference: 15.198812
Max relative difference: 3.0246766
 x: array(20.223748)
 y: array([5.024938], dtype=float32)

======================================================================
FAIL: test_value_078_cuda (tests.test_hausdorff_distance.TestHausdorffDistance)
device: cuda metric: euclidean directed:True expected: 19.924858845171276
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/parameterized/parameterized.py", line 620, in standalone_func
    return func(*(a + p.args), **p.kwargs, **kw)
  File "/workspace/MONAI/tests/test_hausdorff_distance.py", line 194, in test_value
    np.testing.assert_allclose(expected_value, result.cpu(), rtol=1e-6)
  File "/usr/local/lib/python3.10/dist-packages/numpy/testing/_private/utils.py", line 1592, in assert_allclose
    assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/usr/local/lib/python3.10/dist-packages/numpy/testing/_private/utils.py", line 862, in assert_array_compare
    raise AssertionError(msg)
AssertionError: 
Not equal to tolerance rtol=1e-06, atol=0

Mismatched elements: 1 / 1 (100%)
Max absolute difference: 19.924858
Max relative difference: inf
 x: array(19.924859)
 y: array([0.], dtype=float32)

Answer 1 · 2024-08-29T17:21:06.000Z

The issue occurred when update base image to 24.08