在windows，使用train_agent_multiprocessing出现tensor被篡改的情况

Question

在windows，使用train_agent_multiprocessing出现tensor被篡改的情况

Opened this issue a year ago · 1 comments

本人的环境是这样的：
操作系统：win10
cuda：11.8
cudnn：8.8.1
python：3.9.13
pytorch：2.0.0
ElegantRL是最新开发版

情况描述：
如果使用train_agent_multiprocessing，Learner进程通过管道把actor发到work进程这里：

'''Learner send actor to Workers'''
for send_pipe in self.send_pipes:
    send_pipe.send(agent.act)

agent.act的成员state_std，会从
tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], device='cuda:0')
变成
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')
就是从1豹子变成0豹子
在使用pdb调试后，发现问题出在：
python标准库的Lib\multiprocessing\reduction.py这个文件，class ForkingPickler(pickle.Pickler)这个类很奇怪，他是pickle.Pickler的子类，他有个类方法是这样的

@classmethod
def dumps(cls, obj, protocol=None):
    buf = io.BytesIO()
    cls(buf, protocol).dump(obj)
    return buf.getbuffer()

cls(buf, protocol)是一个ForkingPickler对象，但ForkingPickler没有定义dump函数，所以按继承的原则，应该继承pickle.Pickler的dump方法，但我在pickle.Pickler的dump里面下断点，没有命中断点，而且我打印cls(buf, protocol).dump这个方法，是一个built-in 方法，在执行完cls(buf, protocol).dump(obj)这句后，obj的state_std成员就被篡改了
网上对cls(buf, protocol).dump这个built-in方法的介绍不多，请熟悉cpython的老鸟不吝赐教
另外，我查了一些python多进程在不同操作系统的差异的资料，比如这篇：
https://www.pythonforthelab.com/blog/differences-between-multiprocessing-windows-and-linux/
然后根据这篇文章写了一段测试程序：

import os
import multiprocessing as mp
import torch
import torch.nn as nn
import torch.multiprocessing as mp  # torch.multiprocessing extends multiprocessing of Python

if os.name == 'nt':  # if is WindowOS (Windows NT)
    """Fix bug about Anaconda in WindowOS
    OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.
    """
    os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"

class MyModule(nn.Module):
    def __init__(self):
        super().__init__()
        self.state_std = nn.Parameter(torch.ones((10,)), requires_grad=False)

class MyClass:
    def __init__(self, gpu_id):
        device = torch.device(f"cuda:{gpu_id}" if (torch.cuda.is_available() and (gpu_id >= 0)) else "cpu")
        self.my_module = MyModule().to(device)
        print(f'state_std in init: {self.my_module.state_std}')

    def simple_method(self):
        print('This is a simple method')
        print(f'state_std in simple_method: {self.my_module.state_std}')

    def mp_simple_method(self):
        self.p = mp.Process(target=self.simple_method)
        self.p.start()

    def wait(self):
        self.p.join()
        print(f'state_std in wait: {self.my_module.state_std}')


if __name__ == '__main__':
    """Don't set method='fork' when send tensor in GPU"""
    method = 'spawn' if os.name == 'nt' else 'forkserver'  # os.name == 'nt' means Windows NT operating system (WinOS)
    mp.set_start_method(method=method, force=True)

    my_class = MyClass(0)
    my_class.mp_simple_method()
    my_class.wait()

在使用CPU作为device的时候，是不会出现篡改情况的，但如果使用CUDA，在windows出现篡改，在linux不会。

Answer 1 · 2023-04-28T03:02:33.000Z

谢谢你开启的issue，原来还有这种情况。我可能需要找一台能跑 pytorch 的 windows 电脑试着复现你遇到的问题。

method = 'spawn' if os.name == 'nt' else 'forkserver' # os.name == 'nt' means Windows NT operating system (WinOS)

代码中的这一行，就是之前我们发现 pytorch 的多进程在 Windows系统有问题，才加上去的。
看来 pytorch 的多进程在 Windows系统还有其他坑。