openai/roboschool

roboschool affects pytorch.multiprocessing

ShangtongZhang opened this issue · 1 comments

I'm using os x 10.12, my pytorch version is 0.2.0_3, python version is
3.6.3 (default, Oct 4 2017, 06:09:15)
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.37)]

See the following minimal example:

import torch
import torch.nn as nn
from torch.autograd import Variable
import numpy as np
import torch.multiprocessing as mp
import time
import sys
import roboschool

print(torch.__version__)
print(sys.version)

# batch_size = 64
batch_size = 128

def train(id):
    while True:
        fc = nn.Linear(5, 100)
        x = Variable(torch.FloatTensor(np.random.rand(batch_size, 5)), volatile=True)
        y = fc(x)

num_workers = 8
ps = [mp.Process(target=train, args=(i, )) for i in range(num_workers)]
for p in ps: p.start()
while True:
    time.sleep(1)
    for i, p in enumerate(ps):
        if not p.is_alive():
            print('Worker %d exited unexpectedly.' % i)
            p.terminate()
            ps[i] = mp.Process(target=train, args=(i, ))
            ps[i].start()
            print('Worker %d restarted.' % i)
            break
for p in procs: p.join()

It will output:

0.2.0_3
3.6.3 (default, Oct  4 2017, 06:09:15) 
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.37)]
Worker 0 exited unexpectedly.
Worker 0 restarted.
Worker 0 exited unexpectedly.
Worker 0 restarted.
Worker 0 exited unexpectedly.
Worker 0 restarted.

If I change the batch size to 64, then it works well.
If I don't import roboschool, both 64 and 128 work well.

I notice a similar issue here #53
but mp.set_start_method('spawn') doesn't work for me

I have no idea. Sometimes esoteric things happen. Yes, like in issue that you link, it can be libraries or load order of libraries. Let me know if you have workaround that will work for most people.