[DIPU] mock_cuda设为true后，torch.device导致guard检查失败造成重新编译的问题。

背景

编译模式下，在燧原和华为机器上运行llama_finetune。

问题描述

在mock_cuda为true的情况下，若在模型运行前调用torch.set_default_device，分析timeline发现每次都会触发dynamo的compile，查看日志调查发现：

torch._dynamo.convert_frame.__recompiles: [DEBUG] ('Recompiling function _is_fp16_bf16_tensor in /home/cse/zhousl/accelerate/src/accelerate/utils/operations.py', "triggered by the following guard failure: utils_device.CURRENT_DEVICE == device(type='cuda', index=0)")

是utils_device.CURRENT_DEVICE == device(type='cuda', index=0) 检查失败导致每次都会重新编译。

分析发现：

调用torch.set_default_device时，会触发torch中的这段代码对CURRENT_DEVICE进行设置，这时候用到的torch.device为torch原本的api。
https://github.com/DeepLink-org/pytorch/blob/eb31e39bb07f00cf1c917a3bc83867981b2c04cd/torch/utils/_device.py#L57C1-L65C35

而在运行过程中对utils_device.CURRENT_DEVICE == device(type='cuda', index=0) 做检查时，调用的是dipu mock后的torch.device api，mock前后的torch.device调用结果并不一致。

deeplink.framework/dipu/torch_dipu/dipu/device.py

Lines 38 to 67 in 3ecb00b

    
           class _DIPUDevice(metaclass=_MetaDeviceType): 
        
               @staticmethod 
        
               def __replacedipu(arg): 
        
                   if (__dipu__ in arg): 
        
                       arg = arg.replace(__dipu__, __dipu_device_type__) 
        
                   if (mockcuda and "cuda" in arg): 
        
                       arg = arg.replace("cuda", __dipu_device_type__) 
        
                   return arg 
        
               def __new__(cls, *args, **kwargs): 
        
                   if len(args) == 1 and isinstance(args[0], int) and mockcuda: 
        
                       # modify default int device type only when "mock cuda". 
        
                       dev_name = __dipu_device_type__ + ":" + str(args[0]) 
        
                       _device = _MetaDeviceType._torch_device(dev_name) 
        
                       return _device 
        
                   # handle device as str 
        
                   if len(args) >= 1 and isinstance(args[0], str): 
        
                       argList = list(args) 
        
                       argList[0] = cls.__replacedipu(args[0]) 
        
                       args = tuple(argList) 
        
                   # handle parameter type: str, not support int type but str and device 
        
                   deviceValue = kwargs.get("type", None) 
        
                   if deviceValue != None and isinstance(deviceValue, str): 
        
                       kwargs["type"] = cls.__replacedipu(deviceValue) 
        
                   _device = _MetaDeviceType._torch_device(*args, **kwargs) 
        
                   return _device 
        
           # always patch: device class is immutable, cannot directly patch __new__ method on python layer. 
        
           torch.device = _DIPUDevice

简单来说就是torch_dipu中缺少torch中从torch.set_default_device到CURRENT_DEVICE 的完整逻辑链路，且torch.device mock前后在做guards检查时并不一致。
可通过如下代码复现：
'import torch
t1 = torch.device(type='cuda', index=0)
import torch_dipu
t2 = torch.device(type='cuda', index=0)
print(t1 == t2) # 结果为False'
目前可通过不调用torch.set_default_device，或者在import torch之前先import torch_dipu(这样可以在设置CURRENT_DEVICE前就让dipu mock掉torch.device api)来避免guards检查失败导致重编这个问题。

	class _DIPUDevice(metaclass=_MetaDeviceType):
	@staticmethod
	def __replacedipu(arg):
	if (__dipu__ in arg):
	arg = arg.replace(__dipu__, __dipu_device_type__)
	if (mockcuda and "cuda" in arg):
	arg = arg.replace("cuda", __dipu_device_type__)
	return arg

	def __new__(cls, args, *kwargs):
	if len(args) == 1 and isinstance(args[0], int) and mockcuda:
	# modify default int device type only when "mock cuda".
	dev_name = __dipu_device_type__ + ":" + str(args[0])
	_device = _MetaDeviceType._torch_device(dev_name)
	return _device
	# handle device as str
	if len(args) >= 1 and isinstance(args[0], str):
	argList = list(args)
	argList[0] = cls.__replacedipu(args[0])
	args = tuple(argList)
	# handle parameter type: str, not support int type but str and device
	deviceValue = kwargs.get("type", None)
	if deviceValue != None and isinstance(deviceValue, str):
	kwargs["type"] = cls.__replacedipu(deviceValue)
	_device = _MetaDeviceType._torch_device(args, *kwargs)
	return _device


	# always patch: device class is immutable, cannot directly patch __new__ method on python layer.
	torch.device = _DIPUDevice