openai/glow

train error when use multiGpus

paulchou0309 opened this issue · 0 comments

Run mpiexec -n 4 python3 train.py --problem celeba --image_size 256 --n_level 6 --depth 32 --flow_permutation 2 --flow_coupling 0 --seed 0 --learntop --lr 0.001 --n_bits_x 5

Error :
Rank 2 Batch sizes Train 1 Test 1 Init 4
Rank 1 Batch sizes Train 1 Test 1 Init 4
Traceback (most recent call last):
File "train.py", line 413, in
main(hps)
File "train.py", line 145, in main
train_iterator, test_iterator, data_init = get_data(hps, sess)
File "train.py", line 108, in get_data
hps.local_batch_test, hps.local_batch_init, hps.image_size, hps.rnd_crop)
File "/AI/home/zhoujia/Image/glow/data_loaders/get_data.py", line 63, in get_data
train_file = get_tfr_file(data_dir, 'train', int(np.log2(resolution)))
File "/AI/home/zhoujia/Image/glow/data_loaders/get_data.py", line 55, in get_tfr_file
assert len(files) == int(files[0].split(
IndexError: list index out of range
Rank 3 Batch sizes Train 1 Test 1 Init 4
Traceback (most recent call last):
File "train.py", line 413, in
main(hps)
File "train.py", line 145, in main
train_iterator, test_iterator, data_init = get_data(hps, sess)
File "train.py", line 108, in get_data
hps.local_batch_test, hps.local_batch_init, hps.image_size, hps.rnd_crop)
File "/AI/home/zhoujia/Image/glow/data_loaders/get_data.py", line 63, in get_data
train_file = get_tfr_file(data_dir, 'train', int(np.log2(resolution)))
File "/AI/home/zhoujia/Image/glow/data_loaders/get_data.py", line 55, in get_tfr_file
assert len(files) == int(files[0].split(
IndexError: list index out of range
Traceback (most recent call last):
File "train.py", line 413, in
main(hps)
File "train.py", line 145, in main
train_iterator, test_iterator, data_init = get_data(hps, sess)
File "train.py", line 108, in get_data
hps.local_batch_test, hps.local_batch_init, hps.image_size, hps.rnd_crop)
File "/AI/home/zhoujia/Image/glow/data_loaders/get_data.py", line 63, in get_data
train_file = get_tfr_file(data_dir, 'train', int(np.log2(resolution)))
File "/AI/home/zhoujia/Image/glow/data_loaders/get_data.py", line 55, in get_tfr_file
assert len(files) == int(files[0].split(
IndexError: list index out of range
Rank 0 Batch sizes Train 1 Test 1 Init 4

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[45327,1],2]
Exit code: 1

How can make it works?