[Bug] 性能问题：_get_valid_value函数首次调用torch.Tensor.item()时耗时过长

Question

[Bug] 性能问题：_get_valid_value函数首次调用torch.Tensor.item()时耗时过长

BenjaminPang opened this issue 3 months ago · 0 comments

Prerequisite

I have searched Issues and Discussions but cannot get the expected help.
The bug has not been fixed in the latest version(https://github.com/open-mmlab/mmengine).

Environment

OrderedDict([('sys.platform', 'win32'), ('Python', '3.10.13 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:24:38) [MSC v.1916 64 bit (AMD64)]'), ('CUDA available', True), ('MUSA available', False), ('numpy_random_seed', 2147483648), ('GPU 0', 'NVIDIA GeForce RTX 3070'), ('CUDA_HOME', 'C:\Program Files\
NVIDIA GPU Computing Toolkit\CUDA\v11.6'), ('NVCC', 'Cuda compilation tools, release 11.6, V11.6.55'), ('MSVC', 'Microsoft (R) C/C++ Optimizing Compiler Version 19.38.33130 for x64'), ('GCC', 'n/a'), ('PyTorch', '1.13.1+cu116'), ('PyTorch compiling details', 'PyTorch built with:\n - C++ Version: 199711\n -
MSVC 192829337\n - Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications\n - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)\n - OpenMP 2019\n - LAPACK is enabled (usually provided by MKL)\n - CPU capability usage: AVX2\n

CUDA Runtime 11.6\n - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;
arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37\n - CuDNN 8.3.2 (built against CUDA 11.5)\n - Magma 2.5.4\n - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.6, CUDNN_VERSION=8.3.2, CXX_COMPILER=C:/actions-runner/_work/pytorch/pytorch/builder/windows/tmp_bin/sccache-cl
.exe, CXX_FLAGS=/DWIN32 /D_WINDOWS /GR /EHsc /w /bigobj -DUSE_PTHREADPOOL -openmp:experimental -IC:/actions-runner/_work/pytorch/pytorch/builder/windows/mkl/include -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO, LAPACK_INFO=mkl,
PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.13.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=OFF, USE_NNPACK=OFF, USE_OPENMP=ON, USE_ROCM=OFF, \n'), ('TorchVision', '0.14.1+cu116'), ('OpenCV', '4.9.0'
), ('MMEngine', '0.10.3')])

Reproduces the problem - code sample

def _get_valid_value(
    self,
    value: Union['torch.Tensor', np.ndarray, np.number, int, float],
) -> Union[int, float]:
    """Convert value to python built-in type.

    Args:
        value (torch.Tensor or np.ndarray or np.number or int or float):
            value of log.

    Returns:
        float or int: python built-in type value.
    """
    import time
    s = time.time()
    if isinstance(value, (np.ndarray, np.number)):
        assert value.size == 1
        value = value.item()
    elif isinstance(value, (int, float)):
        value = value
    else:
        # check whether value is torch.Tensor but don't want
        # to import torch in this file
        assert hasattr(value, 'numel') and value.numel() == 1
        value = value.item()
    print(f"get_valid_value use {time.time() - s}")
    return value  # type: ignore

在mmengine的logging/message_hub.py文件中，_get_valid_value函数在被running_info_hook的after_train_iter方法调用时，使用torch.Tensor.item()进行类型转换时的性能存在问题。我的测试表明，该函数在第一次调用时耗时显著，随后的调用耗时则为零，这在训练循环中由于after_train_iter被频繁调用而导致整体耗时严重上升。

复现步骤

运行一个包含after_train_iter调用的训练循环。
观察_get_valid_value函数中torch.Tensor.item()调用的耗时。

以下是调用时间的输出：

get_valid_value use 0.02899909019470215
get_valid_value use 0.0
get_valid_value use 0.0

单位：秒

预期行为

我期望torch.Tensor.item()调用不会在首次调用时造成如此显著的延迟。

Reproduces the problem - command or script

No comment

Reproduces the problem - error message

No comment

Additional information

为了解决这个性能问题，我建议考虑修改base_model中的parse_losses函数，让其提前进行类型转换，将损失值和准确率等转换为标量（scalars），以避免在_get_valid_value中进行昂贵的torch.Tensor.item()调用。这是一个可能的解决方案示例：

# 修改后的 parse_losses 函数
def parse_losses(
    self, losses: Dict[str, torch.Tensor]
) -> Tuple[torch.Tensor, Dict[str, float]]:
    # ... 保留原有代码 ...
    log_vars = [[key, value.mean().item()] for key, value in log_vars]
    # ... 保留原有代码 ...
    return loss, log_vars  # type: ignore